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ABSTRACT 


A  VLSI  architecture  for  computing  the  discrete  Fourier  transform  (DFT)  using  the  VVino- 
grad  Fourier  transform  algorithm  (WFTA)  is  presented.  This  architecture  is  an  address¬ 
less,  routed,  bit-serial  scheme  that  directly  maps  an  Af-point  algorithm  onto  silicon.  The 
architecture  appears  to  be  far  less  costly  than  systolic  schemes  for  implementing  the 
WFTA,  and  faster  than  current  FFT  devices  for  similar  transform  sizes.  The  nesting 
method  of  Winograd  is  used  for  partitioning  larger  transformations  into  several  circuits. 
The  advantage  of  this  partitioning  technique  is  that  it  allows  using  circuits  that  are  all  of 
the  same  type.  However,  the  number  of  input/output  pins  of  each  circuit  is  higher  than 
with  some  other  approaches  like,  for  example,  the  prime  factor  algorithm.  The  design  of 
a  20-point  DFT  circuit  with  logic  diagrams  of  its  major  cells  is  presented.  The  gate  array 
cirniit  has  been  sent  for  fabrication  in  a  0.7/tfp  CMOS  technology.  Five  circuits  intercon¬ 
nected  together  will  compute  60-point  complex' transforms  at  a  rate  of  one  transformation 
ev<'rv  0.53«s. 

'  ■  .  i  ,  ■ 

RESUME 

line  architecture  VLSI  pour  le  calcul  de  la  transformation  discrHe  de  Fourier  en  utilisant 
I'algorithme  de  transformation  de  Fourier  de  Winograd  (ATFW)  est  presentee.  Cette? 
architecture  est  un  arrangement  route,  bit-seriel,  et  sans  adresses  qui  transpose  directe- 
inent  un  algorithme  de  taille  donnee  sur  silicium.  L’architecture  s’avere  etre  beaucoup 
moins  routeuse  que  les  systemes  systoliques  pour  implanter  I’ATFW,  et  plus  rapicle  ((ue  les 
di.spositifs  courants  de  transformation  rapide  de  Fourier  pour  des  longueurs  de  transfor¬ 
mation  comparables.  La  methode  de  “tissage”  de  Winograd  est  utilisee  pour  fragmenter 
des  transformations  plus  longues  sur  plusieurs  circuits  distincts.  L’avantage  de  celte 
technique  de  fragmentation  est  qu’elle  permet  d’utiliser  des  circuits  tous  du  meme  type. 
Cependant,  le  nombre  de  broches  d ’entree/sortie  de  chaque  circuit  est  plus  eleve  qu'avec 
rl'antres  approches  comme,  par  exemple,  I’algorithme  de  factorisation  premier.  La  con- 
c(?l)tion  d’un  circuit  de  transformation  discrete  de  Fourier  pour  20  points  est  presentee, 
avec  des  diagrammes  logiques  pour  ses  principales  cellules.  Le  circuit  a  ete  soumis  pcjur 
fal)rication  avec  une  matrice  de  portes  dans  un  precede  CMOS  de  0.7/:«m.  Cimi  circuits 
inl('r( onnectes  ensemble  pourront  calculer  des  transformations  complexes  de  60  points  a 


EXECUTIVE  SUMMARY 


Electronic  warfare  systems  rely  more  and  more  on  the  development  of  digital  processing 
to  increase  their  signal  handling  capability.  This  trend  stems  in  good  part  from  the 
convenience  and  low  cost  of  semiconductor  devices  and  the  emergence  of  very  large  scale 
integration  (VLSI)  systems.  The  object  of  this  research  is  to  investigate  new  means  of 
computing  the  discrete  Fourier  transform  (DFT)  at  very  high  speed  using  VLSI  circuits. 
Tlie  discrete  Fourier  transformation  is  a  widely  used  algorithm  for  switching  between  the 
l  ime  and  frequency  representations  of  sampled  waveforms. 

The  performance  of  DFT  circuits  and  boards  is  determined  by  the  transforma¬ 
tion  algorithm  and  by  the  architecture  used  to  implement  the  algorithm.  In  commercial 
products,  the  algorithm  and  architecture  are  chosen  for  their  flexibility,  in  an  attempt  to 
facilitate  many  applications.  Practically  all  commercial  devices  use  the  FFT  algorithm, 
which  allows  varying  the  transform  size  N  across  a  wide  range  of  values  cind  implementing 
the  flivision  by  N  of  the  inverse  transformation  with  an  inexpensive  bit  shift. 

Unfortunately,  commercial  DFT  circuits  and  boards  don’t  deliver  the  through¬ 
puts  that  are  needed  in  many  electronic  warfare  applications.  Higher  throughputs  can  b(' 
obtained  by  using  several  DFT  processors  in  parallel,  but  this  generally  leads  to  compli¬ 
cated  and  expensive  implementations,  which  are  limited  by  the  need  for  multiplexers  ami 
demultiplexers,  increased  bulk,  lower  reliability,  and  higher  power  consumption. 

This  report  presents  a  DFT  architecture  aimed  at  applications  where  DFTs 
must  be  computed  at  very  high  speeds,  and  where  the  number  of  points  N  is  fixed  and 
not  too  large,  typically  a  few  hundreds  or  less.  This  architecture  is  not  based  on  the 
FF  T.  Instead  it  uses  an  algorithm  that  was  invented  by  VVinograd  in  1976.  The  Winoijvatl 
Fourier  transform  aUjorithm  (WFTA)  computes  the  same  transformation  as  the  FFT.  and 
uses  fewer  multiplications.  In  the  proposed  architecture,  an  ;V-point  VVFT.A  is  mapped 
directly  onto  a  VLSI  circuit  using  an  addressless,  bit-serial  scheme.  The  smaller  number 
of  multiplications  yields  silicon  area  savings  which  can  be  traded  for  a  higher  throughput 
or  a  larger  transform  size  N .  Due  to  the  complicated  indexing  scheme  of  the  VVFT.4. 
the  layout  requires  substantial  routing  between  its  arithmetic  cells  and  the  architecture 
is  called  routed. 

If  a  layout  turns  out  to  be  too  large  to  fit  on  a  single  circuit,  the  architecture 
can  !)(.'  partitioned  into  .several  identical  circuits  using  the  nesting  method.  The  nesting 
nu'thod  has  also  been  invented  by  VVinograd.  .As  higher  length  algorithms  are  constructed, 
the  nesting  method  uses  less  multiplications  than  other  construction  algorithms,  including 
the  |)rime  factor  algorithm. 


To  validate  the  routed  architecture  and  the  partitioning  strategy,  a  VLSI  circuit 
has  been  designed  at  DREO  and  sent  to  a  silicon  foundry.  The  CMOS  circuit  contains 
55  000  gates  and  can  compute  by  itself  20-point  complex  DFTs.  The  nesting  method  allows 
the  interconnection  of  five  circuits  to  compute  60-point  complex  DFTs.  Assuming  16-bit 
input  samples,  the  predicted  speed  of  1.8  million  transformations  per  second  is  about 
three  to  ten  times  higher  than  that  of  commercial  chip  sets*.  In  the  prototype  circuit,  the 
adders  have  been  organized  in  layers  and  interconnected  by  software.  The  48  multipliers 
have  been  carefully  designed  to  minimize  their  gate  count  without  compromising  their 
speed  and  accuracy.  The  circuit  can  accept  samples  of  any  precision  in  fixed-point  two's 
complement  format,  and  output  coefficients  with  up  to  10  bits  of  accuracy. 

The  Air  Force  Institute  of  Technologyin  in  Dayton,  Ohio,  is  also  developing 
Winograd  Fourier  transform  circuits.  At  this  time,  15-,  16-,  and  17-point  DFT  circuits 
are  being  designed  and  tested.  The  three  full-custom  circuits  are  slightly  slower  than  the 
DllEO  gate  array,  but  they  are  more  accurate  because  their  multipliers  have  more  stages. 
The  three  circuits  are  meant  to  be  interconnected  using  the  prime  factor  algoritlini  to 
form  part  of  a  40S0-point  DFT  machine^. 

The  VVFT.A  and  the  routed  architecture  are  not  without  disadvantages.  First,  as 
the  transform  size  is  increased,  the  routing  gradually  grows  and  may  become  impractical 
to  handle.  Second,  the  complex  indexing  scheme  of  the  WFTA  restrains  the  flexibility 
with  respect  to  iV.  Lastly,  the  WFTA  favors  using  values  of  N  that  are  not  powers  of  2: 
hence  the  division  by  N  appearing  in  the  inverse  transformation  does  not  reduce  to  a  bit 
shift  like  in  the  FFT. 

The  WFT.^  should  not  be  viewed  as  a  replacement  for  the  FFT,  but  rath('r 
as  a  complementary  algorithm  with  its  own  advantages  and  inconveniences,  which  may 
find  use  in  different  applications.  The  WFTA  and  the  routed  architecture  are  attractive 
for  applications  requiring  high  throughput,  cost  effective  DFT  computation  for  moderate 
transform  sizes.  For  instance,  the  routed  architecture  is  currently  being  considered  at 
DREO  for  processing  radar  pulses  in  real-time  upon  interception  by  radar  electronic 
support  measures  (ESM)  systems. 


*  [  lie  fastest  available  cliip  set,  to  the  authors’  knowledge,  is  manufacf'ired  by  Honeywell  atui  tuiisi.its 
Ilf  12  ClaAs  circuit.s. 

^  riie  authors  whish  to  thank  Mark  A.  Mehalic,  AFIT,  for  the  information  provided. 
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1.0  INTRODUCTION 


The  introduction  by  Winograd  [i],[2],  and  Agarvval  and  Cooley  [3],  of  new,  short  length 
discrete  Fourier  transformation  algorithms  requiring  fewer  multiplications  than  the  fast 
Fourier  transform  [4], [5]  stirred  interest  in  the  signal  processing  community.  In  combina¬ 
tion  with  his  high  speed  algorithms,  Winograd  proposed  a  nesting  method  for  constructing 
algorithms  of  higher  lengths.  The  algorithms  obtained  by  m^^ans  of  this  nesting  method 
are  known  a.s  Winograd  Fourier  transform  algorithms  (WFTA)  [6]. 

Another  method,  which  is  based  on  the  Good-Thomas  prime  factor  algorithm 
(Plv\)  [7], [8],  has  been  proposed  by  Kolba  and  Parks  [9]  for  computing  long  discrete 
Fourier  transforms  with  Winograd’s  short  length  algorithms.  The  PFA  and  nesting  m(;th- 
ods  can  be  combined,  making  it  possible  to  obtain  in-place  and  in-order  algorithms  [lO]. 
However,  only  the  nesting  method  of  Winograd  is  considered  in  this  report,  mostly  because 
it  minimizes  the  number  of  multiplications. 

.Apart  from  their  theoretical  value,  which  was  immediately  recognized,  W  i no¬ 
grad's  algorithms  have  found  very  few  applications  since  their  introduction.  One  of  the 
underlying  difficulties  with  these  algorithms  is  that  their  additions  are  nested  in  a  compli¬ 
cated  and  irregular  manner.  Early  results  showed  thcit  WFT.A  software  sometimes  ruii.s 
faster,  or  slower,  than  the  FFT  on  computers  like  the  IBM  370  [6j.[,9j,[J  1|-[M].  Vririous 
hardware  architectures  for  the  WFT.A  and  its  variants  have  been  proposed,  but,  to  the 
authors'  knowledge,  none  has  been  demonstrated  using  a  complete  prototype.  .As  a  rc.'sult. 
the  FFT  is  still  considered  as  the  algorithm  of  choice  in  the  practical  world. 

The  widespread  perception  that  Winograd's  algorithms  do  not  lend  themselves 
well  to  either  hardware  or  .software  realizations  is  now  being  challenged.  The  change  stems 
from  the  emergence  of  new  computer  architectures,  higher  component  densities  on  V  f.SI 
circuits,  and  more  powerful  compilers  and  computer-aided  design  tools.  For  instance.  Lu. 
Cooley  and  Tolimieri  have  recently  shown  [1-5]  that  variants  of  the  WFT.A  can  cwecule 
more  efficiently  than  the  FFT  on  RISC  computers  having  a  "floating-ijoint  multiply-add" 
feature.  Aloisio  et  al.  have  implemented  the  PF.A  on  hypercube  computers  [16].  .Al  this 
time,  the  .Air  Force  Institute  of  Technology  is  developing  15-.  16-.  and  17-point  \\  !•  F.A 
integrated  circuits  [17], [18]'. 

The  argument  for  using  the  WFT.A  instead  of  the  FFT  in  high-speed  VL.Sl  la'- 
alizations  is  simple.  Since  the  WFT.A  requires  fewer  multiplications  than  the  FF'F.  and 


'  .Vt'ter  work  on  tlie  pre.scnt,  report  was  well  under  way,  the  authors  beca'ne  acqiiaitited  with  tin'  Al’n 
|irn|i  rt  .uul  were  plea.santly  siirpn.sed  by  the  siniiharities  between  the  AFIT  and  DllEO  eircints 


multipliers  in  VLSI  are  very  expensive^,  the  WFTA  should  yield  smaller,  more  cost- 
eff  '  \  e  VLSI  realizations.  Figure  1  shows  the  minimum  number  of  multiplications  in  the 
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Figure  1:  Number  of  non-trivial  real  multiplications  in  the  FFT  and  VVinograd  algorithms 
as  a  function  of  the  number  of  points  .V. 

VVF'l'A  and  FF'l’L  as  a  function  of  the  number  of  points  A'.  It  is  easily  verified  that  the 
VVF'f.A  requires  two  to  three  times  fewer  multiplications  than  the  FFT  for  N  >  60.  fhe 
difference  increases  with  .'V,  as  the  WFTA  requires  a  number  of  multiplications  propor¬ 
tional  to  iV,  while  for  the  FF  f  the  proportionality  is  to  A  log(iV)  [20].  The  numljer  of 
additions  remains  approximately  the  same. 

In  this  report,  we  examine  the  implementation  of  the  WFT.A  in  VLSI  form  for  the 
liigh-speed  calculation  of  moderate  length  (less  than  a  few  hundred  points)  discrete  Fourier 
transforms.  We  propose  a  new  VTSI  architecture  with  detailed  designs  of  its  dilierc’iit 
hardware  cells.  To  |)ut  this  in  perspective,  a  quick  review  of  some  of  the  architectures 
proposed  in  the  jiast  is  useful.  In  1980,  Zohar  [21]  proposed  running  the  WFTA  on  a 
dedicated,  address- lia-sed  machine  wirh  one  multiplier  and  two  adders.  In  1982.  Ward  and 

-  .\  CMO.S  miiltiplior  of  lengtli  U  bits  typically  contains  about  ‘2li,  times  more  gates  than  an  adder. 

Mn  Fig.  1,  I  he  number  of  arithmetic  operations  in  the  radix-2  FFT  has  been  reduced  by  exploiting 
die  symmetries  in  the  sine  and  cosine  functions,  and  by  implementing  the  complex  multiplications  with 
three  real  multiplications  and  three  real  additions.  The  complex  multiplication  algorithtn  with  three  real 
multiplications  ran  be  found  m  [19.  Sect.  3.7.2] 


Staiiier  [22]  designed  a  systolic  arcliitecture  for  the  WFTA;  such  architectures  produce 
[■('gular  layouts  and  allow  very  high  clock  rates  [23],[24|.  Then,  in  1983,  MacLeoil  and 
Bragg  [25]  suggested  directly  mapping  the  algorithm’s  data  flow  onto  hardware,  using 
hit-serial  arithmetic.  In  1985,  Costello  [26]  compared  the  PFA  to  other  techniques  for 
radar  beam  forming  and  concluded  that  the  former  was  much  cheaper  to  implement  with 
dedicated  hardware.  .At  about  the  same  time,  Ward,  McCanny,  and  McWhirter  [2(],[28], 
and  shortly  after,  Owens  and  Ja’Ja  [29],  introduced  more  systolic  architectures  for  the 
WFT.\.  Lastly,  in  1988,  Linderman  et  al.  [18]  presented  the  design  of  three  full-custom 
WFT.'X  circuits  destined  for  a  4080-point  PFA  realization.  The  operations  in  the  circuits 
are  carried  out  bit-serially.  while  the  data  transfers  between  the  circuits  and  the  main 
memory  are  bit-parallel. 

Pursuing  the  idea  of  MacLeod  and  Bragg,  we  propose  an  addressless,  bit-serial 
architecture  that  directly  maps  the  WFTA  of  interest  onto  a  VLSI  circuit.  Probing  further, 
we  e.xamine  in  detail  the  hardware  cells  and  their  interconnections,  and  actually  provide 
the  specifications  of  a  20-point  WFT  circuit  that  has  been  sent  for  fabrication  in  a  0.7  /mi 
"gate  array”  CMOS  technology.  We  found  that  for  implementing  the  WFT.-\,  an  appioach 
lil«'  .MacLeod  and  Bragg’s  yields  the  same  performance  as  a  systolic  architecture,  but  al  a 
lower  cost  [30].  A  20-point  WFT  circuit,  for  example,  contains  about  47  000  gates  and  fils 
on  a  moderately  large  gate  array.  By  comparison,  the  systolic  architecture  of  Ward  tl  al. 
would  require  300000  gates'*  and  a  much  larger  die  size.  Hence,  the  layout  of  our  circuit 
ends  np  being  more  compact,  despite  some  irregular  portions  having  complicated  routing. 
From  a  design  effort  standpoint,  the  20-point  WFT  circuit’s  schematics  were  manually 
entered  in  the  chip  manufacturer’s  design  system  in  seven  man-weeks.  Interconnecting 
the  adders  took  only  a  small  portion  of  that  time.  .As  early  as  next  year,  some  chip 
manufacturers  will  add  to  their  design  software  sets  a  ‘‘logic  synthesis”  tool  that  will 
rfireclly  read  logic  specifications,  and  eliminate  the  error-prone  and  often  tedious  task  of 
drawing  the  schematics^.  This  will  make  routed  architectures  more  attractive  in  geiK'ral. 

The  novelty  of  the  proposed  architecture  lies  at  the  system  level,  where  the 
WF'I'  circuits  exchange  data  for  computing  discrete  Fourier  transforms  of  higher  lengths. 
Instead  of  relying  on  the  standard  PF.A  for  partitioning  the  transformation,  w('  took 
Winograd’s  nesting  method.  This  allowed  us  to  design  a  20-point  WFT  circuit  such  that 
by  assembling  five  devices,  they  can  compute  60-point  transforms*’.  Including  this  feat  ure 


Section  7.0  for  gate  count  e(|iiations. 

'  riie  authors  are  grateful  to  A.  Boubguira.  LSI  Logic  Co.  of  Canada,  for  tlie  information  provid<Hl. 
'';\n(jther  possibility,  in  fact,  would  be  to  use  a  single  device  five  limes  in  succession.  This  discu.s.'^ioii 
Ignores  the  po.ssible  reduction  in  hardware  which  can  be  obtained  for  lower  transformation  rates. 
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in  the  20-point  WFT  circuit  in^'reased  the  gate  count  by  16%'  and  added  80  pins  to  the 
package.  The  advantage  of  this  approach  is  that  it  can  be  implemented  with  circuits 
all  of  the  same  type.  Also,  the  number  of  multipliers  that  are  used  is  always  kept  to  a 
minimum,  thus  as  higher  density  processes  become  available,  a  multi-circuit  configuration 
can  be  directly  combined  to  fit  on  a  single  integrated  circuit.  The  disadvantage  of  the 
approach  is  that  it  requires  more  pins  than  the  PFA  for  inter-circuit  data  exchanges. 
This  is  the  price  for  minimizing  the  number  of  multipliers  in  the  data  path  and  for  using 
circuits  that  are  identical. 


Table  1;  Comparison  of  the  proposed  WFTA  architecture  to  commercial  FFT  devices.. 


Device!  s) 

Circuit 

Count 

N 

Throughput 

(samples/s) 

F’  Mue 
of  .Merit 

L642S0/S1  (LSI  Logic)*  [31] 

3 

40MHz 

64 

4.3  X  10* 

1.4 

.A  ll  102  (Austek  .Microsystems)  [32] 

1 

40MHz 

64 

2.5  X  10* 

2.5 

HFFP  (Honeywell)t 

12 

250MHz 

64 

41.7  X  10* 

3.5 

a661 10/210  (array  Microsystems)  [33] 

2 

40MHz 

64 

13.1  X  10* 

6.6 

PDSP16510  (Plessey  Semicond.)  [34] 

1 

40MHz 

64 

16.4  X  10* 

16.4 

WFT  circuit  (DREO) 

5 

■iiimpa 

22.2 

*  Tliis  is  a  floating  point  chip  set.  All  the  others  are  fixed  point. 

1  I  his  is  a  GaAs  chip  set  (advance  information  10/91).  All  the  others  are  CMOS 


For  comparing  the  proposed  WFTA  architecture  to  current  FFT  schemes,  we 
use  four  FFT  chip  sets  that  are  commercially  available*.  It  is  assumed  that  the  discrete 
Fourier  transformations  are  on  complex  data.  Table  1  gives  the  number  of  circuits  in  each 
set.  the  clock  rate,  and  the  throughput  rate  in  complex  samples/s  for  a  transformation 
of  length  N .  Also  shown  is  a  simple  but  intuitive  “figure  of  merit”  obtained  by  dividing 
the  throughput  by  the  number  of  circuits.  The  higher  the  figure  of  merit  is,  the  better. 
Of  the  FFT  circuits,  the  PDSP16510  by  Plessey  Semiconductors  is  the  only  one  that  has 
a  figure  of  merit  clo.se  to  that  of  the  WFT  circuit.  However,  the  PDSP 16510  contains 
at  least  twice  as  many  gates  as  the  WFT  circuit,  and  costs  about  four  times  as  niuch. 
Compensating  for  silicon  area  and  speed  discrepancies  wo. Id  increase  the  figure  of  merit 
of  the  WFT  circuit  to  75,  i.e.  at  about  four  times  the  value  of  the  top  FFT  device^. 

'  The  gate  count  of  the  circuit  therefore  adds  up  to  47  000  -f  8  000  =  55  000  gates. 

'*4'he  FFT  can  also  he  computed  on  digital  signal  processors  [35],  but  at  slower  speeds. 

‘To  obtain  the  higher  figure  of  merit,  one  could  fit  the  175000  gates  required  by  the  60-point  WFTA 
on  two  larger  or  higher-density  circuits  and  raise  their  clock  rate  to  40MHz. 
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Oil  the  other  hand,  the  FFT  devices  offer  more  flexibility  with  regards  to  the  number  of 
points  of  the  transformation.  The  WFT  circuit  is  limited  to  two  transformation  sizes:  20 
and  60  points*®.  This  comparison  illustrates  well  that  the  FFT  and  WFTA  offer  different 
advantages  and  limitations,  and  are  therefore  suited  to  different  applications. 

This  report  is  addressed  to  scientists  who  are  studying  the  high-speed  calculation 
of  the  discrete  Fourier  transform,  to  engineers  who  design  hardware  for  that  computation, 
i.e.  VLSI  circuits,  and  possibly  to  users  of  this  hardware.  No  specific  mathematical 
background  is  required.  The  hardware  descriptions  are  very  detailed,  mostly  because  the 
only  way  to  obtain  accurate  gate  counts,  and  cost  estimates,  is  by  unfolding  a  complete 
logic  design.  It  is  our  hope  that  the  logic  cells  presented  here  will  be  helpful  in  other 
bit-serial  circuits.  S.  Martineau  did  most  of  the  cell  design  work.  P.  Lavoie  proposed  the 
VL.SI  architecture  and  compared  it  to  other  schemes  from  speed  and  cost  standpoints. 

The  report  is  organized  as  follows.  In  the  next  section,  the  discrete  Pourier 
transformation,  VVinograd’s  short  length  algorithms  and  the  systolic  architecture  of  Ward. 
-VlcCanny,  and  McWhirter  are  briefly  reviewed.  The  routed  architecture  is  introduced  in 
Section  3.0  using  a  5-point  transformation  example.  Then,  in  Section  4.0,  the  logic  design 
of  a  20-point  WFTA  circuit  based  on  the  routed  architecture  is  presented  in  detail.  .A 
technique  for  laying  out  and  interconnecting  the  adders  is  proposed.  Multipliers  with  small 
gate  counts  are  introduced.  The  partitioning  of  a  higher  length  60-point  algorithm  in  five 
circuits  is  explained.  This  partitioning  follows  a  novel  approach  based  on  Winograd’s 
nesting  method.  Internal  scaling  of  the  data  to  prevent  overflows  is  included  in  all  the 
cells.  The  section  ends  with  a  recapitulation  of  the  various  cells  that  are  required  and 
their  gate  counts.  In  Section  5.0,  computer  simulations  of  the  20-point  WFTA  circuit  are 
presented.  The  accuracy  of  the  Fourier  coefficients  produced  by  the  circuit  is  measured. 
Practical  considerations  like  the  testability,  clock  speed  and  pin  requirement  are  examined 
in  Section  6.0.  Lastly,  in  Section  7.0,  the  routed  architecture  is  compared  to  the  systolic 
architecture  of  Ward  et  al.  and  to  a  straightforward  FFT  design  from  a  cost  point  of  view, 
riie  20-  and  60-point  WFTA  algorithms  are  derived  in  Appendix  A.  The  twiddle  factors 
that  must  be  stored  into  the  WFT  circuit  are  listed  in  Appendix  B.  The  logic  symbols 
appearing  in  the  figures  are  described  in  Appendix  C. 

'"In  many  high  speed  applications  this  is  not  much  of  an  inconvenience  since  the  number  of  points  .V 
is  (i.'ced. 
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2.0  WARD,  McCANNY  AND  McWHIRTER’S  SYSTOLIC 

ARCHITECTURE 

2.1  THE  DISCRETE  FOURIER  TRANSFORMATION 

The  A-point  discrete  Fourier  transform  synthesis  equation  is 

N-l 

I-  =  0,1,...,A-1  ,  (1) 

n=0 

with 

14/ =  ^  (2) 

where  {/lo, /fi,  •  •  •  M/v-i}  denotes  the  discrete  Fourier  transform  (DFT)  of  a  sequence 
of  iV  evenly-spaced,  possibly  complex  samples  {co,  aj, . . . ,  The  operation  of  com¬ 

puting  the  DFT  of  a  sequence  is  called  the  discrete  Fourier  transformation.  The  original 
se(|uence  can  be  recovered  from  its  DFT  by  the  analysis  equation 

,  n  =  0,l,...,A-l  .  (3) 

k=o 

'rhis  operation  is  called  the  inverse  discrete  Fourier  transformation.  It  is  very  similar  in 
form  to  the  discrete  Fourier  transformation. 

The  inverse  discrete  Fourier  transformation  can  be  implemented  using  a  forward 
DFT  device  by  reversing  the  order  of  the  outputs  I  through  A  —  1  and  dividing  their 
value  by  A.  If  A  =  r/'’  and  numbers  are  represented  in  9-ary  digits,  then  the  division  by 
A  reduces  to  shifting  the  point  r  positions.  When  using  an  FFT,  A  is  usually  a  power  of 
two.  and  inverse  transforms  are  thus  easily  computed.  Winograd  algorithms  for  A  =  2'' 
have  been  derived  [36]-[38],  but  they  require  more  multiplications  and  additions  than  the 
FFT  for  A  >  .32. 

2.2  WINOGRAD’S  SHORT  LENGTH  DISCRETE  FOURIER  TRANSFOR¬ 
MATION  ALGORITHMS 

Kach  of  the  short  length  discrete  Fourier  transformation  algorithms  introduced  by  Wino- 
gifvd  consists  of  a  sequence  of  additions,  followed  by  multiplications,  and  by  more  addi¬ 
tions.  Winograd  has  given  algorithms  for  2-,  3-,  4-,  5-,  7-,  8-,  9-,  and  16-point  DFTs.  and 
algorithms  for  other  lengths  can  be  found  in  [36]-[40]. 

For  example,  the  5-point  algorithm  producing  the  DFT  {Ao, -4i,  A2, /I3, /D}  of 
an  input  sequence  {oq,  Ui,  02, 03, 04}  consists  of  the  following  operations: 
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Additions:  ■ 


.s'l  —  o,\  4"  <24 
■S5  =  -Sl  +  -*3 


S2  —  dl  —  <24 
Se  =  -Si  —  •S3 


S3  —  <23  +  <22 
S7  =  S2  +  S4 


S4  —  <23  —  <22 
Ss  =  S5  +  Uq 


M  nltipii  ca  t  ions : 


mo  =  I  ■  38 

_  f  cos  u-cos  2u  I  ^  „ 

m2  =  [ - 2 - 1  j  •  -Se 

m4  =  i  sin  '2n  ■  37 


/cosu+cos2u  1  \  .  I 

mi  =  ^ - ij  ■  S5  I 

m3  =  i(sin  u  4-  sin  2u)  ■  32 

ms  =  i(sin  u  —  sin  2u)  •  S4 


\  with  u  = 


Additions: 

39  =  mo  +  mi  3io  =  39  +  m2  Sii  =  39  —  m2  S12  =  m3  —  m4 

Sl3  =  m^  +  ms  3i4  =  3io  +  S12  Sis  =  Sio  —  S12  S16  =  Sii  +  Si3 

■^17  =  Si  1  —  3i3 

On.  I  put: 

/lo  =  mo  Ai  =  Si4  A2  =  Si6  A3  =  3i7  -44  =  3is 

The  algorithm  contains  17  complex  additions*'  and  6  complex  multiplications. 

The  fixed  factor  in  each  multiplication  is  either  purely  real  or  imaginary,  never  a  complex 
number.  This  is  a  property  of  Winograd’s  algorithms.  Each  complex  multiplication  can 
therefore  be  computed  using  just  two  real  multiplications  instead  of  three  or  four. 


2.3  WARD,  McCANNY  AND  McWHIRTER’S  SYSTOLIC  ARCHITEC¬ 
TURE 

riie  architecture  presented  here  has  been  proposed  by  Ward,  McCanny  and 
McWhirter  [27], [28].  It  has  been  chosen  over  other  architectures  [22], [29]  because  it  is 
complete,  simple,  and  representative  of  the  group.  This  architecture  falls  into  the  "sys¬ 
tolic”  category,  and  hence  it  has  several  attractive  properties  [23], [24].  However  it  is  rather 
inefficient  in  its  use  of  the  silicon  area,  especially  for  large  transform  sizes.  This  draw¬ 
back  actually  motivated  us  to  develop  a  different,  more  compact  architecture,  which  is 
presented  in  the  next  section.  Since  the  systolic  architecture  provides  general  insight  into 
the  implementation  of  Winograd’s  algorithms,  and  might  be  useful  for  some  applications, 
it  is  described  first. 

In  the  systolic  architecture,  the  number  of  multipliers  is  kept  small,  whereas  the 
iinml)er  of  adders  is  allowed  to  grow  proportionally  to  N'^.  This  allows  computing  the 

"The  term  additions  as  used  here  refers  to  both  the  addition  and  the  subtraction  operations  of  the 
•dgorithm. 


additions  using  four  regular  arrays  of  cells.  For  implementing  an  A^-point  transformation, 
the  "systolic”  arrays  are  programmed  in  such  a  way  that  some  cells  are  active,  i.e.  compute 
an  addition,  while  others  are  simply  used  as  delay  units.  The  arrays  compute  far  more 
additions  than  required  by  Winograd’s  algorithms,  and  the  resultant  layouts  are  therefore 
not  as  compact  as  they  could  be.  On  the  other  hand,  since  the  layout  regularity  is  very 
high,  both  the  design  time  and  the  risk  of  a  design  error  are  reduced.  The  clock  rate  of 
the  arrays  is  independent  of  their  size,  and  very  high  computational  speeds  can  be  reached 
even  with  very  large  arrays  [24]. 

The  architecture  accepts  and  processes  the  N  input  samples  in  parallel,  and  is 
called  bit-serial  because  the  samples  enter  and  travel  in  the  circuit  in  a  serial  fashion,  i.e. 
bit  by  bit.  The  data  rate  is  inversely  proportional  to  the  number  of  bits  per  sample.  High 
processing  rates  are  achieved  by  computing  all  the  N  Fourier  coefficients  in  parallel.  I’lie 
arcliitecture  requires  '2N  input  and  2N  output  pins,  for  the  complex  data  samples  anil 
Fourier  coefficients,  respectively. 

The  architecture  is  best  understood  when  an  A^-point  transformation  is  written 
in  the  form: 

A  =  Z(XaxYb).  (4) 

I'he  input  samples  and  DFT  coefficients  form  the  vectors  a  and  A,  respectively.  X  is  an 
.17  X  ./V  (row  X  column)  matrix  and  Z  is- an  N  x  M  matrix,  where  M  denotes  the  number 
of  complex  multiplications  of  the  A'’-point  transformation.  The  matrices  X  and  Y  contain 
only  -4-1,  —1  and  0  values.  It  is  through  the  matrix- vector  products  that  the  additions 
of  VVinograd’s  algorithm  are  carried  out.  The  product  Yb  can  be  precalculated  so  as  to 
form  a  set  of  M  twiddle  factors  having  either  a  purely  real  or  imaginary  value. 

The  5-point  transformation  [2],  for  example,  can  be  calculated  using  (4)  where 

1  1  1  1  \ 

1111 
1-1-11 
1  0  0  -1  ’ 

I  -1  I  -1 

0-11  0  y 

'•’These  twiddle  factors  play  a  role  similar  to  the  twiddle  factors  in  the  FFT,  but  they  differ  from  the 
latter  in  number  and  value. 


X  = 


/ 1 
0 
0 
0 
0 
V  0 
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and 


(Yb)‘  = 


'cos  u+coa  2^  ^  t  ^ 


cost*— cos  2u 


i(sin  u  +  sin2w) 
i  sin  2« 
i(sin  ii  —  sin  2u)  ) 


with  u  =  Y 


(  1 


Z  = 


1 

1 

1 


\  1 


0 

1 

1 

1 

1 


0  0  0  0  \ 

11-10 
-10  11 
-1  0  -1  -1 

1-11  0  y 


The  systolic  architecture  would  yield  an  implementation  with  2M  =  12  multi¬ 
pliers,  and  AM N  =  120  array  cells,  of  which  84  would  be  performing  additions.  When 
considering  that  Winograd’s  algorithm  uses  just  -34  additions,  the  systolic  architecture 
a|)pears  inefficient.  However,  it  is  simple  and  very  regular. 

The  number  of  gates  in  one  array  cell  is  now  examined.  Using  the  cell  func¬ 
tionality  described  in  [27],  a  logic  diagram,  such  as  the  one  shown  in  Fig.  2,  can  be 
designed.  This  particular  design  contains  84  gates  if  it  were  implemented  using  a  popular 
CMOS  library  [41]  with  flip-flops  featuring  clear  and  scan.  Based  on  this  design,  the  total 
number  Ga  of  gates  in  the  four  systolic  arrays  is: 


Ga  =  336  NM  . 


(5) 


The  main  drawback  of  this  architecture  is  that  the  systolic  arrays  reejuires  a 
number  of  cells  that  is  proportional  to  N"^.  As  a  result,  the  architecture  quickly  becomes 
prohibitively  expensive  ais  N  increases. 


3.0  A  ROUTED  ARCHITECTURE  FOR  THE  WFTA 

111  t  his  section,  a.  cost-effective  bit-serial  architecture  for  the  WFT.A  is  presented.  Even 
though  this  architecture  lacks  some  of  the  elegant  properties  of  the  systolic  architec¬ 
tures  [24],  it  should  yield  circuits  having  smaller  areas,  and  hence  allow  discrete  Fourier 
transformations  of  higher  lengths. 

In  the  proposed  architecture,  an  Af-point  WFTA  is  mapped  directly  onto  silicon, 
with  a  minimum  of  modifications.  This  follows  the  idea  of  MacLeod  and  Bragg  [25].  The 
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V 


Y* 


Figure  2:  Logic  diagram  of  one  systolic  array  cell.  The  logic  symbols  are  described  in 
Appendix  C. 

resultant  layout  exhibits  little  regularity,  as  wires  of  various  length  connect  the  adders. 
Because  the  architecture  requires  routing  between  the  adders,  we  refer  to  it  as  the  routed 
architecture. 

The  adders  can  be  organized  in  layers.  The  layers  can  then  be  stacked  and 
interconnected  using  a  “channel  routing”  software.  Assume  that  the  Winograd  nesting 
method  [2]  is  used  for  constructing  an  Af-point  algorithm  from  two  smaller  Afi-,  and  .V2- 
point  algorithms.  Let  N  =  N1N2,  and  Ni  and  N2  be  relatively  prime.  The  adders  in  the 
.V- point  implementation  are  now  examined.  Let  Li  and  L2  denote  the  number  of  layers 
of  adders,  and  Ai  and  A2  denote  the  largest  number  of  adders  per  layer,  in  the  Ni-,  and 
AVpoint  implementations,  respectively.  Let  Mi  denotes  the  number  of  multiplications  in 
t  he  .V|-point  implementation.  The  number  of  layers  (L)  and  the  largest  number  of  adders 
per  layer  (>l)  in  the  .'V-point  implementation  are  given  by: 

L  =  Li  +  L2  ,  (6) 

A  =  ma.x[N2Ai,  M1A2]  .  (7) 

The  number  of  layers  grows  proportionally  to  log(A"),  whereas  the  number  of 
adders  per  layer  grows  proportionally  to  N.  The  total  number  of  adders  is  therefore 
proportional  to  /V  log(iV).  Note  that  this  does  not  take  into  account  the  routing  between 
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and  through  the  layers,  and  the  silicon  area  itself  asymptotically  grows  at  a  higher  rate'^. 
Nevertheless,  for  the  values  of  N  considered  here,  the  adders  occupy  more  area  than  the 
interconnections,  and  their  number  is  relevant. 

In  the  routed  architecture,  the  adders  are  divided  into  four  groups.  Two  groups 
compute  the  real  and  imaginary  additions  before  the  multiplications;  the  two  others  com¬ 
pute  the  additions  that  follow  the  multiplications.  Figure  3  shows  a  floorplan  for  the 


Real  part  of 
input  samples 


Imaginary  part  of 
input  samples 


Real  part  of 
DFT  coefficients 


Iniaginary  part  of 
DFT  coefficients 


Figure  3:  Floorplan  of  the  routed  architecture. 


routed  architecture,  where  the  real  parts  of  the  input  samples  are  processed  on  one  side, 
and  the  imaginary  parts  on  the  other.  The  two  sides  are  identical,  except  in  the  multi¬ 
pliers  where  some  twiddle  factors  differ  in  sign.  The  computations  of  the  two  side  can 
thus  be  carried  out  using  separate  circuits,  provided  that  approximately  M/2  pins  are 

'■’'riiompson  [42]  has  shown  with  an  asymptotical  analysis  that  the  total  area  of  any  circuit  computing 
the  DFT  in  fixed  time  must  grow  proportionally  to  N". 
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available  for  data  exchanges  between  the  circuits  and  that  the  twiddle  factors  are  pro¬ 
grammable.  In  other  words,  two  identical  WFT  circuits  designed  to  compute  DFTs  of 
real- valued  input  sequences  could  therefore  be  connected  together  and  compute  DFTs 
of  complex-valued  sequences.  This  advantageous  partitioning  could  not  be  implemented 
easily  with  the  FFT. 

In  a  bit-serial  architecture,  each  addition  can  be  implemented  with  a  single  cell 
of  modest  complexity.  Each  multiplication,  on  the  other  hand,  requires  a  row  of  /,„  cells, 
where  denotes  the  number  of  bits  in  the  multiplicand  (the  twiddle  factor).  Therefore, 
for  moderate  transform  sizes  {N  <  100),  the  multipliers  generally  occupy  more  silicon  area 
than  the  four  groups  of  adders.  For  higher  values  of  N,  the  routing  between  the  layers 
of  adders  occupies  a  higher  percentage  of  the  silicon  area  and  may  cause  difficulties.  It 
should  be  pointed  out  that  the  systolic  architectures  would  also  become  impractical  at 
that  point. 

'I'he  attainable  throughput  is  equal  to  the  clock  rate  divided  by  the  number  of 
bits  per  sample  (/j).  In  order  to  get  a  high  clock  rate  (30MHz  or  higher),  the  "critical 
path"  of  the  circuit,  i.e.  the  electrical  path  with  the  longest  propagation  delay,  must  be 
minimized.  In  the  routed  architecture,  the  critical  path  may  either  be  in  the  adders  or 
ill  the  multipliers.  Indeed,  the  wires  between  the  layers  of  adders  may  be  of  significant 
length  and  exhibit  a  large  capacitance,  having  a  significant  effect  on  the  circuit’s  speed, 
riie  adders  should  therefore  be  pipelined  so  that  the  data  transfers  between  the  different 
layers  occur  simultaneously.  Inside  each  bit-serial  multiplier,  there  is  a  carry  chain  where 
pipelining  should  also  be  applied.  Pipelining  shortens  the  critical  path,  but  increases  the 
latency  of  the  circuit. 

Figure  4  shows  a  block  diagram  of  the  routed  architecture  for  VVinograd's  •> 
point  algorithm.  Samples  enter  bit-serially  at  the  top.  They  traverse  the  first  two  groups 
of  adders  (left  and  right),  are  multiplied  by  the  twiddle  factors,  and  traverse  two  more 
groups  of  adders.  The  bit-serial  DFT  coefficients  then  exit  the  circuit,  hdip-flops  have 
l>een  inserted  after  every  layer  of  adders,  and  at  every  three  stages  of  multiplier  cells,  for 
pipelining. 

.-Xn  important  issue  when  implementing  the  discrete  Fourier  transformation  is 
the  jmrtitioning-.  can  a  computational  load  too  large  for  a  single  device  be  distributed 
among  several  devices  at  a  reasonable  cost?  A  partitioning  technique  based  on  Winograd's 
lu-si  iiig  method  is  now  proposed  for  the  routed  architecture. 

Using  VV'inograd’s  nesting  method,  an  N  =  iVi^Vj-point  algorithm  can  be  con¬ 
structed  from  two  smaller  N\-,  and  A^2-point  algorithms,  conditional  to  and  N2  being 
relatively  prime.  Let  Ti  and  A2  denote  the  number  of  additions,  and  M\  and  M2  denote 
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I'  igme  4:  Th<‘  routed  architecture  for  Winograd’s  5-point  algorithm.  In  this  particular 
case,  the  leftmost  and  rightmost  multipliers  could  be  replaced  by  flip-llops. 
since  their  twiddle  factors  are  equal  to  one. 
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Uie  tuimber  of  multiplications,  in  the  Nf,  and  fV2-point  algorithms,  respectively.  The 
constructed  ,'V-point  algorithm  then  contains: 

M  =  Ml  M2  multiplications  (8) 


and 

A  =  min[Af2^i  +  M1A2,  M2A1  +  Afi/42]  additions.  (9) 

By  examining  the  nesting  method  [2],  the  structure  of  the  Af-point  algorithm  appears 
as  a  "core”  of  Mi  yV2-point  transformations  “surrounded”  by  A1N2  additions.  The  core 
is  likely  to  be  the  expensive  part  because  it  contains  all  the  multiplications.  A  natural 
way  of  partitioning  it  is  between  the  Af2-point  transformations.  Thus  a  circuit  computing 
one  .Vj-point  transform  and  all  the  A1N2  surrounding  additions  could  compute  .V,  of 
t  he  .V  Fourier  coefficients.  Placing  Mi  such  circuits  side  by  side  would  yield  an  A’-point 
transformation  machine.  The  appeal  of  this  approach  is  twofold.  First,  the  numlrer  of 
multipliers  is  minimal.  Second,  the  circuits  are  identical*"*.  The  disadvantages  are  that 
the  .■1|.'V2  surrounding  additions  are  duplicated  Mi  times  (once  per  circuit).  Section  l.'l 
giv('s  an  example  of  the  partitioning  technique  for  M  =  3  and  N2  =  20.  The  example 
shows  that  modifications  can  be  applied  to  the  architecture  for  reducing  its  cost. 

4.0  LOGIC  DESIGN  OF  A  20-POINT  WINOGRAD  FOURIER 

TRANSFORMATION  CIRCUIT 

I'his  section  presents  the  design  of  a  20-point  Winograd  Fourier  transformation  circuil. 
The  architecture  chosen  for  implementing  the  circuit  is  the  routed  architecture  presenti'd 
in  Section  3.0.  We  focus  our  attention  on  the  design  at  the  logic  level,  and  give  schmiiatic 
iliagrams  of  the  cells  required  for  building  the  circuit.  The  design  has  been  given  lo  a 
numufacturer  for  fabrication.  Samples  shall  be  available  by  the  first  quarter  of  1902. 

The  Winograd  Fourier  transformation  circuit,  or  WFT  circuit,  is  designed  lo 
hit-serially  accept  and  produce  data  in  two’s  complement  form.  Surprisingly,  bit-serial 
arit  hmetic  components  capable  of  accepting  data  in  two's-complement  notation  are  hard 
to  find  in  the  literature.  Moreover,  the  few  that  we  found  generally  turned  out  to  he 
expensive  in  the  number  of  gates.  Most  of  the  basic  cells  presented  in  this  section  are 
therefore  either  of  our  own,  or  the  result  of  several  modifications  and  iterations  rd’  a 
puidished  design. 

This  section  has  six  parts.  First,  the  data  format  conv-ention  for  communieat  ing 
' '  This  IS  a.ssiiming  that  the  twiddle  factors  can  be  programmed  to  suit  the  ,V-point  transformation 


14 


with  the  WFT  circ  it  is  described.  The  cells  required  for  the  additions  are  presented. 
Then  the  functionality  required  for  a  60-point  multi-circuit  mode  is  included  in  the  circuit, 
riie  multipliers  are  described  in  great  detail.  Next  the  position  of  the  binary  point  at  the 
output  of  the  circuit  is  examined.  The  different  modes  of  the  circuit  are  explained  with 
the  associated  control  -ignals.  Lastly,  gate  counts  are  given  for  all  the  cells  and  for  the 
complete  circuit. 

4.1  DATA  FORMAT 

The  tiata  format  has  an  impact  on  both  the  accuracy  of  the  output  and  on  the  cost  of 
the  design.  .A  fixed  point  format  is  cheaper  to  implement  than  a  floating  point.  However, 
if  the  adders  aiul  multipliers  use  fixed  point  arithmetic,  then  overflows  might  occur  in 
till'  circuit  when  the  valid  range  determined  by  the  number  of  bits  T  is  exceeded.  The 
following  four  strategies  can  be  used  to  deal  with  this  problem: 

1.  fliddrooin  provision:  I'he  number  of  bits  /,  is  increased  while  keeping  the  .simi|)le 
values  constant,  so  only  a  fraction  of  the  input  range  available  is  used. 

■J.  Fixed  scalinij:  The  outputs  of  some  adders  and  multipliers  are  scaled  down.  The 
least  significant  bit  (LSB)  is  dropped  at  given  stages  of  the  computations  to  increase 
the  dynamic  range  av.  ilable. 

Automatic  scalinp:  Scaling  down  is  automatically  applied  where  necessary.  Thus 
t)verflows  never  occur.  This  is  sometimes  called  ‘‘block  floating  point"'. 

1.  Flontiufj  point:  Fach  individual  data  element  is  represented  in  a  floating  point  for¬ 
mat. 

The  first  strategy  (headroom  provision)  is  the  simplest,  but  it  slows  the  bit- 
sfM'ial  circuit  down  since  I,  is  increased.  The  second  strategy  (fixed  scaling)  is  easily 
implemented  and  can  reduce  the  probability  of  overflow.  However,  if  the  data  is  scaled 
down  more  than  necessary,  then  precision  will  be  lost  at  the  output.  Thus  the  second 
strategy  is  best  ctjmbined  with  the  first  strategy:  adding  headroom  allows  delaying  tlu' 
scaling  to  latter  stages,  where  the  LSBs  are  non-significant  anyways.  The  third  stratc'gy 
(  automatic  scaling)  is  intuitively  appealing  because  just  the  minimum  amount  of  scaling 
is  ap|)lied.  The  last  strategy  (floating  point)  is  the  one  that  provides  the  most  information 
at  the  output.  It  is  more  expensive  than  the  three  others. 

The  WFT  circuit  uses  fixed  scaling,  which  is  fairly  simple  to  implement.  The 
block  floating  point  and  floating  point  strategies  should  be  considered  for  future  imple¬ 
mentations. 
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Each  complex  sample  has  a  real  and  an  imaginary  parts,  whose  values  are  rep¬ 
resented  in  a  fixed  point,  two’s  complement  form.  The  2^  complex  samples  all  enter  the 
VVFT  circuit  synchronously,  in  parallel,  with  their  real  and  imaginary  parts  on  separate 
iii|)nt  pins.  The  number  of  input  pins  is  therefore  equal  to  2N  =  40.  The  least  signif¬ 
icant  hits  enter  the  circuit  first,  and  the  most  significant  bits  last.  Samples  may  be  of 
any  length  but  must  be  separated  by  one  bit,  or  more,  of  padding.  The  values  of  the 
padding  bits  are  discarded  by  the  circuit.  Along  with  the  inputs  samples,  a  control  signal 
(  ailed  "data  valid  ir  ”  (DVI)  indicates  whether  the  2N  accompanying  data  bits  belong  to 
samples  (DVI  =  1)  or  are  simply  padding  bits  (DVI  =  0). 

The  20  complex  Fourier  coefficients  exit  the  WFT  circuit  in  parallel,  on  a  .second 
set  of  2N  =  40  pins,  after  a  certain  delay  due  to  the  pipelining  in  the  circuit.  The  output 
format  is  similar  to  the  input  format,  and  a  “data  valid  out”  (DVO)  signal  generated  by  the 
I  ir(  nit  accompanies  the  Fourier  coefficients.  The  Fourier  coefficients  are  4  bits  long,  with 
<1  binary  point  whose  |)Osition  depends  on  the  input  samples,  the  twiddle  factors,  and  the 
amount  of  "scaling''  that  is  being  applied.  They  are  separated  by  the  same  number  of  bits 
uf  padding  as  the  corres[)oriding  input  samples.  However,  the  values  ot  the  padding  bits 
may  have  changed  and  should  be  ignored.  .An  "overflow”  signal  (OFO)  is  produced  <  n  tin' 
(  loch  cycle  following  the  delivery  of  the  most  significant  bits  of  the  Fourier  coefficients. 
If  this  signal  is  high  (OFO  =  1),  then  an  overflow  has  occurred  somewhere  in  the  circuit 
and  the  corresponding  set  of  Fourier  coefficients  is  invalid.  The  WFT  circuit  pursues  its 
computations  regardless  of  overflows.  .An  overflow  in  one  transformation  does  not  affect 
the  following  transformations. 

4.2  CELLS  FOR  ADDITION  OPERATIONS 

I’he  20-point  WF'^.A'''’  contains  108  complex  (216  real)  additions.  These  additions  are 
divided  into  two  sets:  the  additions  before  the  multipliers  (124  real  additions),  and  the 
additions  after  the  multipliers  (92  real  additions).  Since  the  additions  required  for  the  real 
and  imaginary  data  are  the  same,  only  one  set  of  additions  needs  be  considered.  From 
iliis  point,  only  tin'  additions  on  real  data  are  examined.  It  should  be  kept  in  mind  that 
,dl  the  circuits  sliown  in  this  section  are  duplicated  in  the  WFT  circuit. 

The  adch'is  required  for  the  "pre-t.iultipliers”  additions  can  be  distributed  on  live 

l.i\ Cl  s.  Figiir  ’  ')  shows  t  he  20  in})nt  samples  ag.  aj . a^g.  and  the  adders  sj .  S2 . sgo. 

oigani/ed  in  five  horizontal  layers.  The  layers  are  numbered  from  one  to  five,  from  to|( 
to  bottom,  '['he  first  layer  contains  the  adders  Si  through  S20  whose  inputs  are  connecti'd 

'  ’Si'c  Appendix  A  for  the  20-poi 't  VVFTA. 
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only  to  the  input  samples.  The  second  layer  contains  the  adders  whose  inputs  connect  to 
tlie  first  layer,  the  third  layer  contains  adders  whose  inputs  connect  to  the  second  layer, 
and  so  on.  Each  line  in  Fig.  5  may  represent  more  than  one  data  connection. 

Sometimes  it  is  imperative  to  pass  data  across  one  or  more  layers.  For  instance, 
on  the  second  layer,  two  “through”  wires  allow  the  adders  S54  and  $52  on  the  fourth  layer 
to  get  some  input  from  the  first  layer.  The  data  on  a  through  wire  is  always  transferred 
down  as  adders  on  one  layer  take  their  input  only  from  the  precedent  layers.  Counting 
the  number  of  adders  per  layer,  from  top  to  bottom,  yields  20,  18,  14,  8,  and  2  adders, 
respectively.  The  24  outputs  of  the  adders  enter  into  the  multipliers. 

The  “post-multipliers”  additions  can  be  distributed  on  four  layers.  Figure  G 
shows  the  adders  and  their  data  dependencies.  From  top  to  bottom,  the  layers  have  11, 


multipliers'  outputs  •*•0  •  •  • 


Figure  6:  Post-multipliers  additions  in  a  20-point  WFT.\. 

8,  IG.  and  8  adders,  respectively.  The  20  outputs  of  the  post-multipliers  are  the  real  |)arts 
of  the  Fourier  coefficients  Aq,  A^,  . . . ,  A^g. 

■As  discussed  in  Section  3.0,  pipelining  the  output  of  the  adders  is  recommended 
for  maintaining  a  high  clock  rate.  Then  the  through  wires  must  also  be  pipelined  to  ensure 
that  the  partial  sums  are  synchronized. 

In  order  to  reduce  the  risk  of  overflow,  programmable  scalers  are  used  for  trun- 
<  ating  the  partial  sums.  If  the  scalers  on  a  layer  are  “enabled”,  then  the  least  significant 
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bit.  of  every  sum  produced  by  that  layer  is  discarded,  making  room  for  the  most  significant 
Int.  and  shifting  the  binary  point  by  one  position.  Enabling  or  disabling  the  scalers  thus 
allows  tailoring  the  precision  and  dynamic  range  of  the  WET  circuit  to  the  statistics  of 
the  input  samples.  Scalers  must  also  be  inserted  along  through  wires. 

Now  that  the  global  organization  of  the  additions  has  been  examined,  the  logic 
diagrams  of  the  five  associated  cells  are  presented.  There  is  a  padding  cell,  an  adder, 
an  overflow  detection  cell,  a  subtracter,  and  a  hold-up  cell.  Since  the  designs  are  mostly 
self-explanatory,  the  explanations  are  brief. 

4.2.1  Padding  Cell 

The  multipliers,  in  order  to  work  properly,  require  that  the  most  significant  bit  of  the 
multiplicand  be  followed  by  at  least  one  padding  bit  having  the  same  value.  This  con¬ 
straint  can  be  met  by  making  the  samples  go  through  padding  cells  upon  their  entry  into 
the  circuit.  .4  padding  cell  is  shown  in  Fig.  7.  A  sample  enters  the  cell  by  the  input  X 


Figure  7:  Logic  diagram  of  a  padding  cell, 
aiul  exits  by  the  output  X’. 

In  the  VVFT  circuit,  the  external  DVI  signal  is  delayed  by  one  clock  cycle  with 
ies|;ect  to  the  data  and  becomes  a  “reset”  signal  (R)  that  is  used  directly  by  the  arithmetic 
cells.  Before  exiting  the  circuit,  the  data  is  delayed  by  one  clock  cycle  with  respect  to  the 
signal  R,  and  the  latter  is  output  as  DVO.  The  circuit  shown  under  the  padding  cell  in 
Fig.  7  transforms  DVI  into  R.  It  is  shared  by  all  the  padding  cells  of  the  circuit. 

4.2.2  Adder 

.A  bit-serial  adder  suitable  for  the  VVFT  circuit  is  shown  in  Fig.  S.  The  two  terms  entering 
on  X  and  Y  are  added  together.  The  resultant  sum  exits  on  S.  Scaling  is  implemented 
through  a  single  multiplexer.  This  multiplexer  is  controlled  by  a  circuit  shared  by  all  the 
adders  on  a  same  layer.  The  shared  circuit  also  resets  the  carry  between  the  additions. 
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Figure  8:  Logic  diagram  of  a  two’s  complement  adder  with  scaling. 

Setting  the  signal  SC  high  (SC  =  1)  scales  down  all  the  outputs  of  the  layer.  A  “partial 
overflow”  signal  (POV)  is  produced  at  every  clock  cycle  by  each  adder. 

4.2.3  Overflow  Detection  Cell 

I'he  signals  POV  produced  by  the  adders  of  a  layer  are  combined  together  in  an  ovtrfiotv 
detection  cell,  as  shown  in  Fig.  9.  This  cell  declares  whether  an  overflow  has  occurred 
(OVF  =  1),  or  not  (OVF  =  0),  in  the  layer. 
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Overflows  may  occur  in  several  layers  of  the  WFT  circuit.  The  overflow  signal 
of  each  layer  must  therefore  be  combined  with  the  overflow  signals  of  the  other  layers  to 
yield  the  overall  OFO  signal.  This  is  done  using  the  scan  circuitry  shown  in  Fig.  10.  The 


Figure  10:  Logic  diagram  of  the  overall  overflow  circuitry. 

OFO  signal  is  output  one  clock  cycle  after  the  MSB  of  the  Fourier  coefficients.  In  a  large 
system,  the  WFT  circuit  may  be  preceeded  by  other  devices  that  may  also  overflow.  .\n 
input  to  the  overflow  scan  (OFI)  has  therefore  be  included  in  the  design.  Whenever  a  set 
of  samples  marked  with  an  overflow  enters  the  WFT  circuit,  the  corresponding  Fourier 
coefficients  are  thus  automatically  declared  invalid. 

4.2.4  Subtracter 

.\  subtraction  can  be  implemented  either  by  fitting  an  adder  with  a  sign  inverter  costing 
IS  gates,  or  by  using  a  true  subtracter.  The  second  approach  requires  more  design  work. 
i>ut  yields  a  cell  that  has  fewer  gates.  A  subtracter  cell  is  shown  in  Fig.  11.  Remarkably, 
it  lias  exactly  the  same  number  of  gates  as  the  adder  of  Fig.  8. 

4.2.5  Hold-up  cell 

Pipelining  must  be  applied  evenly  on  all  the  width  of  a  layer;  otherwise  the  partial  sums 
m  ly  loose  their  synchronism.  This  is  also  true  of  scaling;  otherwise  the  position  of  the 
biliary  point  may  vary  among  the  Fourier  coefficients.  Pipelining  and  scaling  must  there¬ 
fore  be  applied  to  the  data  crossing  a  layer,  or  more,  on  through  wires.  .A  hold-up  cell, 
such  as  the  one  shown  in  Fig.  12.  serves  that  purpose. 
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Figure  11;  Logic  diagram  of  a  two’s  complement  subtracter  with  scaling. 


Figure  12:  Logic  diagram  of  a  hold-up  cell  with  scaling. 


4.3  CELLS  FOR  A  60-POINT  NESTED  TRANSFORMATION 


In  Appendix  A,  a  60-point  algorithm  is  derived  from  a  3-point  algorithm  and  a  20-point 
algorithm.  The  resultant  algorithm  consists  of  a  set  of  additions,  three  20-point  trans- 
forniations,  and  another  set  of  additions.  The  20-point  transformations  are  very  similar 
ta)  discrete  Fourier  transformations,  at  the  exception  of  the  twiddle  factors  that  have  dif¬ 
ferent  values.  Assuming  that  the  twiddle  factors  in  the  20-point  WFT  circuit  can  be 
|)rogrammed  to  the  required  values,  three  circuits  could  therefore  compute  the  “core”  of 
the  ()0-point  algorithm.  Pursuing  this  idea,  the  extra  adders  required  by  the  60-point 
algorithm  could  be  included  in  the  WFT  circuit.  This  would  slightly  increase  the  cost  of 
the  design,  but  greatly  improve  its  versatility  and  usefulness. 

There  are  many  ways  of  dividing  the  computational  load  of  the  60-point  trans¬ 
formation  among  a  set  of  identical  devices.  A  three  circuit  configuration  is  probably  the 
most  efficient  in  terms  of  silicon  area.  However,  the  data  flow  between  the  circuits  would 
re<|uire  240  data  pins  per  circuit,  and  yield  very  high  packaging  costs.  For  many  applica¬ 
tions,  the  configuration  shown  in  Fig.  13  with  five  circuits  instead  of  three  may  provide  a 
more  balanced  solution.  This  configuration  requires  only  160  data  pins  per  circuit.  Three 
of  the  circuits  are  used  for  additions  and  20-point  transforms:  they  accept  the  60  complex 
samples  arranged  in  three  vectors  ao,  aj,  and  a2,  and  produce  three  intermediate  vectors 
of  results  Mo,  M  1,  and  M2.  The  20-point  transformations  are  denoted  by  Wq,  Wi,  and 
W2  Two  circuits  compute  additions  only:  they  accept  Mq,  M^,  and  M2  and  produce 
I  h('  Fourier  coefficients  in  vectors  A^,  and  A2.  One  “FIFO”  circuit  simply  delays  Mq  to 
|)roduce  Aq.  It  is  not  rigorously  required,  but  has  been  included  for  convenience. 

The  Winograd  nesting  scheme,  which  has  been  used  for  building  the  60-|)oiiil 
WF'l'.A.  gets  its  indexing  from  the  “Chinese  Remainder  Theorem.”  The  order  in  which 
the  input  samples  must  be  presented  to  the  five  circuits  is  therefore  rather  peculiar.  The 
reader  is  referred  to  Appendix  A  for  explanations  on  how  the  following  vectors  ao,  a^,  a2. 
Aq.  Aj,  and  A2  are  obtained: 

ao  =  (a0’321'342-33'^24’a45.a6'®27'a48-39-®30-®51'®12’333-354.ai5'®36’357’3i8.a39)  ■ 
ai  =  (340  31. 322, 343, 34, 325, 345, 37, 328, 349, 310,331, 352. ai3'334'a55’3l6'37'a58,ai9)  • 

32  =  (320-341, 32,323, 344, 35, 326, 347, 33, 329, 350, 311, 332, 353, 314, 335, 350, 317, 333, 359)  ■ 

Ao  =  (Ao,  A2i,A42,A3,A24,A45,A6,A27,A48,A9,A30,A5i,Ai2,A33,A54,Ai5,A36,A57,Ai8,A39' 
Ai  =  (A40,  Ai,A22,A43,A4,A25,A46,A7,A28,A49,Aio,A3i,A52,Ai3,A34,A55,Ai6,A7,A58,Ai9) 

A2  =  (A20,  A4i,A2,A23,A44,A5,A26,A47,A8,A29,A50,Aii,A32,A53,Ai4,A35,A56,Ai7.A38,  A59) 


2:} 


3o  3i  3L2 


Figure  13:  Block  diagram  of  a  60-point  DFT  using  five  circuits.  Note  that  the  Fourier 
coefficients  produced  by  the  bottom  left  circuit  run  through  a  FIFO  circuit 
just  for  synchronization  with  the  other  coefficients. 
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An  economical  way  of  implementing  the  extra  additions  that  are  required  by 
the  iV-point  WFTA  consists  of  using  programmable  cells  capable  of  either  adding  or 
subtracting.  These  cells  can  then  be  configured  in  agreement  with  the  role  of  the  circuit 
they  belong  to. 


x 

Y 


Shaf«d 


S  -  X+/-Y 


Figure  14:  Logic  diagram  of  a  programmable  cell  capable  of  either  adding  or  subtracting. 

A  cell  that  can  be  programmed  for  either  adding  or  subtracting  is  shown  in 
Fig.  14.  If  the  control  signal  PR  is  set  high  (PR  =1),  then  the  cell  acts  as  a  subtracter 
and  produces  X  —  Y;  otherwise  it  produces  X  +  Y.  The  usual  timing  and  scaling  control 
circuit  is  shown  under  the  cell. 

4.4  CELLS  FOR  MULTIPLICATION  OPERATIONS 

rile  representation  of  numbers  in  two’s  complement  form,  as  is  convenient  for  addition  and 
subtraction,  complicates  the  multiplication.  Fortunately,  the  problem  of  implementing 
a  bit-serial  multiplier  accepting  two’s  complement  data  has  been  addressed  by  several 
authors.  One  solution  consists  of  changing  the  numbers  to  a  sign  and  magnitude  notation, 
multiplying  their  magnitudes  with  a  standard  bit-serial  multiplier,  and  transforming  the 
result  back  into  a  correct  two’s  complement  number.  The  numbers  can  also  be  recoded 
with  ternary  digits  to  suit  Booth’s  algorithm.  However,  a  more  elegant  approach  has  been 
[Moposed  by  Lyon  [43],  who  has  succeeded  in  modifying  the  original  pipeline  multiplier 
developed  by  Jackson,  Kaiser,  and  McDonald  [44),  which  accepts  positive  data  words  only, 
allowing  it  to  do  correct  two's  complement  multiplication.  This  last  scheme  is  attractive 
for  a  number  of  reasons:  it  is  modular,  i.e.  for  a  /^-bit  twiddle  factor  the  multiplier 
consists  of  Im  identical  cells:  it  rounds  the  products  to  the  same  length  as  the  input  data; 
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it  computes  the  product  at  the  same  rate  as  the  data  is  entered;  and,  lastly,  it  docisn’t 
require  data  converters. 

Lyon’s  fully  two’s  complement  multiplier  can  be  modified  to  better  suit  the  VVFT 
circuit.  First,  it  can  be  “re-timed”  to  reduce  the  amount  of  pipelining  and  the  number  of 
gates.  Re-timing  is  a  technique  for  shortening  or  lengthening  the  critical  path,  and  thus 
the  clock  cycle  duration,  of  VLSI  circuits  [45].  Then,  the  last  two  stages  of  the  multiplier 
can  be  simplified. 

The  two  cells  of  the  modified  multiplier  are  shown  in  Fig.  15.  In  order  to  get  a 
multiplier  of  length  /„,  the  first  cell  must  be  replicated  /m  —  1  times,  and  this  row  must 
be  terminated  by  the  second  cell.  The  L-bit  data  X  travels  through  the  /„  stages  of  the 
multiplier  from  left  to  right,  one  cell  per  clock  cycle.  The  /m-bit  twiddle  factor  Y  enters 
the  multiplier  simultaneously  with  the  data,  and  each  of  its  bits  propagates  to  ail  the  cells 
at  once.  The  output  Z  of  the  last  stage  delivers  the  result,  i.e.  the  most  significant  bits 
of  the  product  of  X  and  Y.  The  data  can  be  either  shorter,  equal  in  length,  or  longer  than 
the  twiddle  factors.  If  it  is  shorter,  then  the  twiddle  factors  must  be  stored  before  the 
multiplications  and  not  changed. 

The  “partial  product  sum”  input  (PPS)  of  the  first  stage  allows  using  an  initial 
olfsc't  for  rounding  [43].  This  “initial  offset”  (10)  is  generated  by  the  circuit  that  is  shown 
in  Fig.  16.  .Note  that  all  the  multipliers  of  the  WFT  circuit  can  share  a  single  offset 
generation  circuit. 

Each  multiplier  cell  has  two  multiplexers  for  selecting  one  twiddle  factor  from 
a  group  of  five  possibilities.  The  multiplexers  are  controlled  by  the  signals  CO,  Cl,  C2. 
and  R.  If  CO  =  0,  Cl  =  0,  and  C2  =  0,  then  the  multiplier  reads  its  twiddle  factor  from  Y. 
The  four  other  possibilities  (TFO,  TFl,  TF2,  and  TF3)  correspond  to  the  fixed  values  that 
are  necessary  for  computing  20-  and  60-point  DFTs.  Table  2  gives  more  detail  on  the 
multiplexers’  control. 

■Appendix  B  provides  all  the  twiddle  factors  that  are  required  by  the  VVFT  circuit. 
.Note  that  the  multiplier  shown  in  Fig.  15  inverts  the  bits  of  the  twiddle  factors  TFl,  TF2, 
and  TF3.  Hence  the  values  given  in  appendix  must  be  inverted  before  being  stored  in  the 
multiplier.  The  bits  of  TFO  are  not  inverted  and  can  be  stored  directly.  The  least  significant 
bit  must  be  stored  into  the  first  multiplier  stage,  where  the  multiplicands  enter  and  the 
most  significant  bit  into  the  last  stage,  where  the  product  exits. 

The  four  flip-flops  marked  "optional  delays’"  in  Fig.  15  allow  pipelining  the  mul- 
tiph'r  to  shorten  the  electrical  path  that  otherwise  runs  from  the  PPS  input  of  stage  one 
to  the  Z  output  of  stage  l^.  Putting  the  flip-flops  at  every  stage  may  be  unnecessary. 
siu(  ('  the  circuit's  critical  path  would  then  surely  move  somewhere  else,  possibly  in  the 
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(a) 


(b) 


Figure  15:  Logic  diagram  of  a  bit-serial  multiplier  cell.  A  master  multiplier  for  /,n-bit 
twiddle  factors  can  be  built  by  juxtaposing  /„  —  1  cells  of  type  (a)  with  one 
cell  of  type  (b). 
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Figure  16:  Logic  diagram  of  an  initial  offset  generation  circuit  for  the  bit-serial  multiplier 
of  Fig.  15.  The  signal  10  is  fed  into  the  input  PPS  of  the  first  multiplier’s  cell. 
The  same  circuit  can  be  shared  by  all  the  multipliers. 


Table  2:  Control  of  the  multipliers’  multiplexers.. 


CO 

Cl 

C2 

R 

Output 

Note 

0 

0 

0 

0 

Y 

external  twiddle  factor 

0 

0 

0 

1 

Q 

stored  twiddle  factor 

1 

0 

0 

X 

TFO 

fixed  value  for  20  point  DFT 

X 

0 

1 

X 

TFl 

fixed  value  for  60  point  DFT  (Wq) 

X 

1 

0 

X 

TF2 

fixed  value  for  60  point  DFT  (Wi) 

X 

1 

1 

X 

TF3 

fixed  value  for  60  point  DFT  (W2) 

routing  between  the  adders,  or  in  the  output  driving  circuits.  At  first  glance,  inserting 
the  flip-flops  at  every  three  stages  should  provide  sufficient  speed,  and  keep  the  cost  of 
the  multiplers  low. 

The  20-point  WFT  circuit  contains  48  multipliers  that  are  evenly  divided  into 
two  groups.  One  group  is  fed  with  the  real  parts,  and  the  other  with  the  imaginary  parts, 
of  24  complex  intermediate  results.  The  fixed  twiddle  factors  are  now  examined  further 
in  detail,  in  an  attempt  to  reduce  the  gate  count  of  the  multipliers.  The  values  of  the 
twiddle  factors  in  the  24  multipliers  that  are  fed  with  real  numbers  can  be  calculatc'd 
using  the  ecjuatious: 


V  =  -pi/2 

u  *  -8*pi/5 


multiplier  0 
multiplier  1 
multiplier  2 
multiplier  3 
multiplier  4 
multiplier  5 


1 

( (cos(u)+cos(2*u))/2-l) 
( (cos(u)-cos(2*u))/2) 
(sin(u)+sin(2*u) ) 
sin(2*u) 

(sin(u) -sin(2*u) ) 
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multiplier  6  : 

1 

multiplier  7  : 

((cos(u)+cos(2*u))/2-l) 

multiplier  8  : 

((cos(u)-cos(2*u))/2) 

multiplier  9  : 

(sin(u)+sin(2*u) ) 

multiplier  10 

sin(2’t‘u) 

multiplier  11 

(sin(u) -sin(2*u)) 

multiplier  12 

1 

multiplier  13 

((cos(u)+oos(2*u))/2-l) 

multiplier  14 

( (cos(u)-cos(2*u) )/2) 

multiplier  15 

(sin(u)+sin(2*u) ) 

multiplier  16 

siii(2*u) 

multiplier  17 

(sin(u) -sin(2*u) ) 

multiplier  18 

sin(v) 

multiplier  19 

sin(v)*((cos(u)+cos(2*u))/2-l) 

multiplier  20 

sin(v)*((cos(u) -cos(2*u) )/2) 

multiplier  21 

sin(v)*(sin(u)+sin(2’*‘u) ) 

multiplier  22 

sin(v)*sin(2*u) 

multiplier  23 

sin(v)*(sin(u)-sin(2*u)) 

It  is  readily  apparent  that  the  twiddle  factors  of  multipliers  0  through  5  are 
id('iitical  to  those  of  multipliers  6  through  11  and  12  through  17.  Since  six  of  the  24  values 
app<vir  three  times  each,  these  six  values  need  only  be  stored  once.  One  can  therefore 
us('  just  12  multipliers  with  twiddle  factor  storage  (master  multipliers),  and  complete  thi.s 
set  with  12  s/ave  multipliers  borrowing  the  twiddle  factors  of  the  first  12.  The  design  of 
.1  sUive  multiplier  is  straightforward,  and  the  cells  obtained  are  shown  in  Fig.  17.  .Now 
considering  the  24  multipliers  fed  with  imaginary  numbers,  the  situation  is  the  same,  the 
only  difference  being  a  sign  inversion  in  the  twiddle  factors  of  multipliers  .3,  4,  5,  9,  10. 
11.  15,  16,  17,  18,  19,  and  20‘^ 

In  order  to  minimize  the  wiring  length  between  the  multipliers,  every  slave  mul¬ 
tiplier  should  be  placed  close  to  its  master  multiplier.  The  order  0,  12,  IS,  1,  7.  13.  19, 

2.  >.  I  1.  20,  3.  9,  15,  21,  4,  10,  16.  22,  5,  IL  17,  and  23,  where  the  master  multipliers  are- 
iliihcizrd,  meets  this  constraint,  and  effectively  places  every  master  multiplier  between  its 
I  wo  slaves. 


"  riiis  i.s  correct  for  a  20-point  transformation  (TFO).  For  a  60-point  tran.^-formation,  this  is  also  true 
for  '  he  .sets  TFl  atul  TF2;  however,  in  TF3,  it  is  the  signs  of  multipliers  0,  1,  2,  6,  7.  8,  12,  13.  14.  21 .  22. 
and  23  that  are  inverted. 


lire  17:  Logic  diagram  of  a  bit-serial  multiplier  cell  without  twiddle  factor  control.  .A 


slnvf;  multiplier /m-bit  long  is  built  by  juxtaposing  Im  —  1  cells  of  type  (a)  with 


one  cell  of  type  (b). 


4.5  POSITION  OF  THE  BINARY  POINT 


I'lu'  position  po  of  the  binary  point  in  the  Fourier  coefficients  output  by  the  WFT  circuit 
is  a  function  of  its  position  in  the  input  samples  (p,)  and  in  the  twiddle  factors  (p„J,  and 
of  the  number  of  layers  whose  scalers  are  active.  Define  the  position  of  the  binary 
point  as  the  number  of  binary  places  between  the  point  and  the  least  significant  bit.  For 
example,  the  position  of  the  binary  point  in  “101.01”  would  be  2,  whereas  its  position 
in  “llu.”  would  be  —1.  The  position  po  of  the  binary  point  in  the  Fourier  coefficients 
produced  by  the  WFT  circuit  is  given  by: 

Po  —  Pi  (fm  Pm  f )  *  (lb) 

4.6  OPERATING  MODES  AND  CONTROL 

In  this  section,  the  WFT  circuit  is  examined  globally.  Its  four  operating  modes,  whicli 
are  mutually  exclusive,  are  described. 

The  first  mode  is  a  “discrete  Fourier  transformation”  mode  (DFT).  This  mode 
allows  computing  20-point  DFTs  using  a  single  circuit,  as  depicted  in  Fig.  18.  The  real 


Figure  18:  Block  diagram  of  a  WFT  circuit  computing  20-point  transforms  in  its  DFT 
mode  . 

and  imaginary  parts  of  the  samples,  which  are  respectively  denoted  by  a  and  a',  enter 
the  circuit  at  the  top  of  the  diagram  and  first  traverse  a  tree  of  adders  normally  used 
in  fiO-point  transformations.  Since  zeros  are  applied  to  the  other  inputs  of  the  tree,  the 
exiting  samples  are  unchanged.  The  samples  then  enter  into  the  transformation  modules 
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denoted  by  W  and  W'.  The  real  and  imaginary  parts  of  the  Fourier  coefficients,  which 
are  denoted  by  A  and  A',  exit  at  the  bottom  of  the  circuit. 

The  mode  DFT  is  also  used  for  computing  the  intermediate  results  Mq,  Mi,  and 
M2,  in  60-point  transformations.  A  five-cicuit  configuration  is  shown  in  Fig.  19.  In  the  .5- 


Fignre  19:  Block  diagram  of  a  -^-circuit  array  computing  60-point  DFTs.  The  top  three 
circuit.s  are  in  the  mode  DFT.  The  bottom  two  are  in  the  mode  ADD.  .\  si.xth 
circuit  (FIFO)  .simply  delays  20  Fourier  coefficients  by  two  clock  cycles. 

circuit  configuration,  the  intermediate  sets  of  results  Mq,  Mj,  and  M2,  enter  two  circuits 
that  are  in  an  "addition”  mode  [ADD),  yielding  40  of  the  60  Fourier  coefficients.  The 
remaining  20  Fourier  coefficients  are  obtained  by  delaying  the  20  intermediate  results  in 
the  set  Mq  by  two  clock  cycles. 

The  third  mode  of  operation  is  a  straightforward  “multiplier”  mode  (MUL).  .\ 
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circuit  in  this  mode  provides  direct  input  and  output  to  24  of  its  multipliers.  Figure  20 
shows  the  corresponding  internal  data  path.  On  one  side  of  the  circuit,  the  12  inputs  x 


Figure  20:  Block  diagram  of  a  VVFT  circuit  in  the  mode  MUL. 

are  multiplied  by  the  12  inputs  y  to  yield  the  12  products  z.  The  same  computations  take 
place  on  the  other  side. 

The  last  mode  is  a  “test”  mode  (TST)  for  validation  of  the  WFT  circuit  after 
fabrication.  This  mode  is  not  a  normal  mode  of  operation.  In  this  mode,  all  the  flip-flops 
in  the  circuit  become  connected  through  a  scan  chain.  Their  values  can  be  shifted  out  of 
the  circuit  and  replaced  by  new  values. 

The  programmable  adders/subtracters  required  for  60-point  transformations  are 
dist  ributed  on  two  layers  and  controlled  by  a  pair  signals  specifying  whether  they  must 
"add  and  add”  (-4/4),  “add  and  subtract”  (-4S),  “subtract  and  add”  (5-4),  or  “subtract  and 
subtract”  (SS).  Figure  19  shows  the  mode  of  each  circuit  {DFT  or  ADD),  the  signals  con¬ 
trolling  the  programmable  adders/subtracters  {-4-4,  -45,  or  5-4),  and  which  twiddle  factors 
are  being  used. 

The  scalers  in  the  two  layers  of  programmable  adders/subtracters  and  in  the 
nine  layers  of  the  20-point  transformation  are  controlled  by  a  3-bit  signal  that  provides 
('iglit  different  settings.  Table  3  shows  the  layers  which  are  scaled  down  and  which  are 
not  for  each  of  the  settings.  .4  “1”  in  the  table  indicates  that  scaling  is  enabled,  a  "O  ’ 
that  it  is  disabled. 

The  internal  data  path  of  the  VVFT  circuit  is  shown  in  Fig.  21.  The  pictni'e. 
whi<  h  provides  wire  counts,  may  appear  complicated  at  first.  However,  the  routing  is 
straightforward  and  practical  for  current  design  tools  and  fabrication  technology. 


33 


Figure  21:  Internal  data  path  of  the  WFT  circuit 


Table  3:  Scaling  control.. 


Layer 

Setting 

0 

1 

2 

3 

4 

5 

6 

7 

Adders/subtracters,  layer  1 

0 

0 

0 

0 

0 

0 

0 

1 

Adders/subtracters,  layer  2 

0 

0 

0 

0 

0 

0 

1 

1 

20-point,  pre-multipliers,  layer  1 

0 

0 

0 

0 

0 

1 

1 

1 

20-point,  pre-multipliers,  layer  2 

0 

0 

0 

0 

1 

1 

1 

1 

20-point,  pre-multipliers,  layer  3 

0 

0 

0 

1 

1 

1 

1 

1 

20-point,  pre-multipliers,  layer  4 

0 

0 

1 

1 

1 

1 

1 

1 

20-point,  pre-multipliers,  layer  5 

0 

1 

1 

1 

1 

1 

1 

1 

20-point,  post-multipliers,  layer  6 

0 

0 

0 

0 

0 

0 

0 

1 

20-point,  post-multipliers,  layer  7 

0 

0 

0 

0 

0 

0 

1 

1 

20-point,  post-multipliers,  layer  8 

0 

0 

0 

0 

0 

1 

1 

1 

20-point,  post-multipliers,  layer  9 

0 

0 

0 

0 

1 

1 

1 

1 

4.7  GATE  COUNT 

U.sing  the  logic  diagrams  of  all  the  cells  presented  in  this  section,  a  gate  count  for  the 
entire  WFT  circuit  is  now  computed.  The  number  of  cells  of  each  type  is  shown  in  Table  4. 

By  following  the  data  progression  in  the  circuit,  one  finds  that  there  are  120 
padding  cells,  one  for  each  data  input,  and  each  cell  has  13  gates.  The  data  then  tra¬ 
verses  80  adders/subtracters  containing  44  gates  each,  and  enter  into  the  20-point  trans¬ 
formation.  120  adders  and  96  subtracters,  containing  36  gates  each,  are  required  for  that 
transformation.  Maintaining  the  synchronicity  of  the  data  path  involves  204  hold-up  cells 
with  scaling;  40  are  in  the  two  layers  of  programmable  adders/subtracters,  and  161  are 
in  the  20-point  transformation  modules.  These  cells  are  inexpensive  at  13  gates  each.  A 
55-gate  overflow  detection  cell  is  required  for  each  of  the  11  layers  of  adders  and  sub¬ 
tracters.  The  master  and  slave  multipliers  require  576  multiplier  cells  having  complexities 
ranging  from  43  to  65  gates.  They  also  require  360  hold-up  cells  containing  10  gates  (;ach. 
Then  40  hold-up  cells  delay  the  output  of  the  data  with  respect  to  the  DVO  and  OFO 
signals.  For  reconfiguring  the  data  path  to  suit  the  various  circuit  modes,  40  2-to-l  and 
24  3-to-l  multiplexers  are  required.  After  adding  everything  together,  the  WFT  circuit 
ends  up  containing  approximately  55000  gates,  and  can  therefore  be  implemented  in  a 
moderately  large  gate  array. 
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Table  4:  Cell  and  gate  counts  for  the  WFT  circuit.. 


Cell  Name 

Count  Gates/Cell 

Total  Gates 

padding  cell 

120 

13 

1560 

adder/subtracter 

80 

44 

3  520 

adder  (with  scaling) 

120 

36 

4320 

subtracter  (with  scaling) 

96 

36 

3456 

hold-up  cell  (with  scaling) 

204 

13 

2652 

overflow  detection  cell 

11 

605 

offset  generation 

2 

93 

186 

master  mult,  (stages  1-11) 

264 

65 

17160 

master  mult,  (stage  12) 

24 

63 

1512 

slave  mult,  (stages  1-11) 

264 

45 

11880 

slave  mult,  (stage  12) 

24 

43 

1032 

hold-up  cell  (without  scaling) 

400 

10 

4  000 

multiplexers  2:1  (data  path) 

40 

4 

160 

multiplexers  3:1  (data  path) 

24 

5 

120 

Gates  in  circuit 

52/, 163 

5.0  LOGIC  SIMULATION 

The  WFT  circuit  has  been  simulated  at  the  logic  level  for  verifying  the  correctness  and 
completeness  of  the  design  presented  in  Section  4.0  and  measuring  the  effect  of  the  trun¬ 
cation  errors.  The  simulations  have  been  carried  out  on  small  computers*’^  using  software 
written  in  the  MATLAB  programming  language  [46].  The  circuit  has  been  modeled  at 
the  gate  level.  Logic  state  transitions  are  synchronous,  implying  that  propagation  delays 
are  not  taken  into  account.  The  size  of  the  source  code  is  about  65  Kbytes.  Simulation  of 
the  DFT  mode  takes  about  ten  seconds  per  clock  cycle.  Computing  the  DFT  of  20  15-bit 
complex  samples  takes  45  clock  cycles  from  the  time  the  circuit  is  reset  until  the  most 
significant  bit  of  the  Fourier  coefficients  exits,  and  hence  leists  for  seven  minutes  and  a 
half. 

An  example  of  simulation  is  now  presented.  The  simulator  is  fed  with  20  complex 
samples  having  real  and  imaginary  15-bit  values  picked  up  at  random  from  the  discrete 
interval  [—2048,2047]  with  all  values  being  equiprobable.  .^s  explained  in  Section  4.5. 
the  binary  point  in  the  Fourier  coefficients  output  by  the  circuit  is  shifted  with  respect  to 
its  position  at  the  input.  According  to  Equation  (10),  with  =  0,  L,  =  0  (no  scaling), 

''Two  workstations,  a  SUN  SPARCStation  I  and  a  SUN  IPC,  have  been  used. 
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Ijn  =  12  and  pm  =  9,  the  point  is  at  position  Po  =  —2  of  the  output.  The  output  of  the 
VVFT  circuit  must  therefore  by  multiplied  by  four  to  yield  the  Fourier  coefficients. 
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Figure  22:  Plot  of  the  Fourier  coefficients  computed  by  simulation  of  the  WFT  circuit  for 
a  set  of  input  samples  chosen  at  random  in  the  interval  [—2  048,2  047].  Both 
the  real  (“o”)  and  the  imaginary  (“+”)  parts  of  the  coefficients  are  shown. 

The  Fourier  coefficients  obtained  by  simulation  are  shown  in  Fig.  22.  Comparing 
the  values  obtained  by  simulation  with  the  theoretical  values  yields  the  errors  shown  in 
Fig.  23.  These  errors  result  from  the  truncation  errors  of  the  twiddle  factors  stored  in 
the  multipliers.  These  initial  truncation  errors  grow  as  they  combine  themselves  in  the 
adders  and  subtracters  that  follow.  The  Fourier  coefficients  of  our  example  end  up  having 
only  ten  significant  bits,  whereas  the  input  samples  and  the  twiddle  factors  had  twelve 
significant  bits.  The  reader  is  referred  to  the  literature  for  more  detailed  error  analysis  of 
VVinograd’s  algorithms  [47]-[52]. 

In  the  previous  example,  the  input  samples  were  assumed  to  be  error-free,  i.e. 
perfectly  accurate.  In  practice,  the  input  values  themselves  may  be  inaccurate,  as  a  result 
of  (|uantization.  for  example.  This  may  further  reduce  the  number  of  significant  bits  in 
the  Fourier  coefficients.  Because  the  errors  in  the  input  samples  combine  themselves  in 
th('  pre-multipliers  additions,  it  would  thus  be  a  good  idea  to  use  more  significant  bits  in 
the  samples  than  in  the  twiddle  factors,  i.e.  to  set  >  l-m- 
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Figure  23:  Plot  of  the  differences,  or  errors,  between  the  Fourier  coefficients  computed 
by  simulation  of  the  WFT  circuit  and  their  theoretical  values  (real  “o”  and 
imaginary  errors). 

6.0  PRACTICAL  CONSIDERATIONS 

The  testability,  speed,  and  number  of  pins  of  the  WFT  circuit  are  examined  in  this  section. 
These  three  issues  are  important  in  practice. 

6.1  TESTABILITY 

To  verify  if  a  circuit  operates  according  to  its  specification,  a  comprehensive  set  of  circuit 
stimuli  with  the  expected  outputs  must  be  prepared.  These  are  called  test  vectors.  Test 
vectors  can  be  categorized  in  two  kinds. 

The  first  kind  of  test  vectors  is  intented  to  guaranty  that  a  device  under  test  has 
no  fabrication  defects,  i.e.  no  faults,  and  operates  tis  predicted  from  its  fabrication  masks**. 
The  fault  coverage  is  generally  measured  by  using  a  single  “stuck  at”  fault  model  [53].  .-X 
score  of  95%  with  this  model  is  usually  considered  sufficient  for  prototyping.  A  95%  fault 
coverage  means  that  one  out  of  twenty  defective  devices  having  a  single  fault  can  pass  the 
test,  and  end  up  in  the  customer's  hands. 

The  generation  and  validation  of  the  first  kind  of  test  vectors  for  the  WFT  circuit 

'*ln  practice,  the  generation  and  validation  of  test  patterns  is  so  difficult  and  costly  that  manufacturers 
often  accept  the  risks  of  shipping  insufficiently  tested  devices  that  may  be  defective. 
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has  been  assigned  to  the  circuit  manufacturer*®  which  uses  a  generic  testing  approach 
ap[)licable  to  any  design.  This  structured  approach  requires  special  circuitry  and  increases 
the  number  of  gates  on  the  circuit  by  about  10  percent.  The  additional  gates  have  lieen 
included  in  the  gate  counts  of  Section  4.7. 

The  second  kind  of  test  vectors  are  aimed  at  verifying  that  the  manufactured 
devices,  and  the  manufacturer’s  circuit  model,  does  what  the  designer  expects.  These 
functional  test  vectors  are  meant  to  ensure  that  the  circuit  has  no  design  flaws.  They 
can  be  used  in  gate-level  circuit  simulations,  i.e.  where  propagation  delays  are  taken 
into  account,  and  during  fabrication,  normally  as  a  complement  to  the  first  kind  of  test 
patterns. 

Functional  test  vectors  cannot  be  exhaustive  because  with  modern  circuit  densi¬ 
ties  tlie  number  of  possible  input  combinations  is  too  large.  Therefore,  the  circuit  designer 
must  prepare  ad  hoc  functional  test  vectors  aimed  at  detecting  common  design  flaws. 

It  would  be  tedious  to  describe  all  the  functional  test  vectors  that  have  been 
prepared  for  the  WFT  circuit.  A  quick  overview  of  the  various  categories  of  test  vectors 
sliould  be  sufficient  to  illustrate  the  concept  of  functional  testability.  The  eight  categories 
of  test  vectors  are  listed  below,  along  with  short  descriptions: 

1 .  Test  adder:  Compute  maximum  positive  value  plus  zero,  one,  and  minus  one.  Com¬ 
pute  minimum  negative  value  plus  zero,  one,  and  minus  one.  Compute  maximum 
value  plus  minimum  value. 

2.  Test  subtracter:  Same  test  as  for  the  adder. 

■].  Test  programmable  adder/subtracter:  Try  the  control  signals  {AA,  AS,  S/4,  and  SS) 
while  in  the  mode  ADD. 

1.  Test  overflow:  In  the  mode  OFT,  with  scaling  enabled  and  disabled,  trigger  overflows 
in  an  adder  using  two  positive  and  two  negative  numbers,  and  do  the  same  in 
a  subtracter  with  two  numbers  with  different  signs.  Repeat  for  pre-,  and  post¬ 
multipliers  layers  as  well  as  for  real  and  imaginary  sides  of  circuit.  Try  to  overflow 
several  adders  per  layer,  and  several  layers  at  once. 

5.  Test  twiddle  factors:  In  the  mode  DFT,  apply  input  vectors  to  circuit  such  that 
every  bit  of  every  twiddle  factor  can  be  observed  at  the  output.  Repeat  for  the  four 
different  settings:  TFO,  TFl,  TF2,  and  TF3. 

'  "I'he  manufacturer  is  LSI  Logic  Co.  of  Canada. 
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6.  Test  multiplier:  For  15-bit  long  data,  separated  by  1-  and  10-bit  long  padding,  mul¬ 
tiply  very  small  and  very  large  values,  testing  all  the  four  possible  sign  combinations. 
Make  sure  that  at  least  one  observable  result  would  be  different  if  the  initial  offset 
circuitry  didn’t  work. 

7.  Test  20-point  DFT :  Input  random  complex  data  with  mixed  signs  and  verify  Fourier 
coefficients  produced  by  circuit.  Try  a  sinewave  that  produces  an  overflow  and  one 
that  does  not.  Repeat  with  scaling. 

8.  Test  60-point  DFT :  Apply  to  the  circuit  the  inputs  it  would  get  if  it  were  used  five 
times  in  succession  to  compute  60-point  discrete  Fourier  transforms. 

Generating  the  functional  test  vectors  using  the  logic  simulator  took  several 
wf^eks  of  CPU  time  on  our  computers.  The  functional  test  is  by  no  means  exhaustive, 
l)ut  it  should  provide  enough  evidence  for  judging  whether  the  circuits  produced  by  the 
manufacturer  are  functional  or  not. 

6.2  SPEED 

The  speed  of  a  WFT  circuit  manufactured  in  a  0.7pm  CMOS  gate  array  technology  is 
now  discussed.  Preliminary  investigation  of  the  circuit  has  indicated  that  its  clock  rate 
will  be  limited  by  a  signal  path  in  the  multipliers.  Assuming  that  pipelining  flip-flops  are 
in.serted  at  every  three  stages  of  multiplier  cells^°,  the  critical  path  would  run  through 
three  full  adders  and  three  multiplexers.  Electrical  simulations  indicate  a  maximum  clock 
rate  of  30  MHz.  This  is  an  approximation,  since  statistical  estimates  of  layout-dependent 
parameters  were  used. 

Assuming  that  the  samples  are  I,  =  15  bits  long,  with  one  bit  of  padding  be¬ 
tween  the  samples  of  two  successive  transforms,  a  30  MHz  WFT  circuit  could  compute  a 
transform  in  16  x  0.033ns  =  0.53ps,  i.e.  compute  over  1.8  million  transforms  per  second. 
Tills  corresponds  to  throughputs  of  37  million  and  111  million  samples  per  second  for 
2()-point  and  60-point  transformations,  respectively. 

6.3  PIN  COUNT 

The  pin  requirement  of  the  WFT  circuit  is  given  in  Table  5.  It  turns  out  that  196  pins 
are  necessary,  of  which  164  are  used  for  transferring  data  and  indicating  overflows.  10  for 
control,  2  for  testability,  1  for  the  external  clock  and  19  for  power.  No  tri-state  pads  are 

-°See  Section  4.1  for  an  explanation  of  these  flip-flops. 
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Table  5:  Input/output  requirement  of  the  WFT  circuit.. 


Name 

Description 

Pins 

in/out 

Data  in 

six  groups  of  20 

120 

in 

Data  out 

two  groups  of  20 

40 

out 

Mode 

mode:  DFT,  ADD,  MUL,  or  TST 

2 

in 

Adder/Subtr.  control 

AA,  AS,  SA,  or  SS 

2 

in 

Twiddle  Factors  control 

TFO,  TFl,  TF2,  or  TF3 

2 

in 

Scaling  control 

eight  different  settings 

3 

in 

Circuit  reset 

RST 

1 

in 

Data  Valid  In 

DVI 

1 

in 

Data  Valid  Out 

DVO 

1 

out 

Overflow  In 

OFI 

1 

in 

Overflow  Out 

OFO 

1 

out 

Scan  In 

for  testability 

1 

in 

Scan  Out 

for  testability 

1 

out 

Clock 

CLK 

1 

in 

Power 

VDD,  GND 

19 

n.a. 

Total 

196 

being  used:  all  pins  are  either  for  input  (134),  output  (43)  or  power  (19).  Packages  with 
196  pins  are  currently  available. 

7.0  COST  COMPARISON 

In  this  section,  the  routed  architecture  is  compared  to  the  systolic  architecture  of  Ward 
et.  al  [27]  and  to  a  parallel  FFT  architecture  from  a  cost  standpoint.  The  unit  measure 
of  cost  is  the  logic  gate,  for  lack  of  a  better  unit  that  would  take  into  consideration  the 
routing  areas.  The  routed  WFTA  and  the  parallel  FFT  architectures  require  routing, 
vvliereas  the  systolic  architecture  does  not. 

The  number  of  additions  and  multiplications  in  the  WFTA  can  be  derived  from 
Fxiuations  (8)  and  (9).  The  additions  and  multiplications  in  the  FFT  can  be  calculated 


by  [19] 

Afft  =  (N/2)( -10  +  7 logjN) +  8  , 

(11 

and 

.VZ/TFr  =  ( A/2)(-10  +  31og2  A)  +  8  . 

(12 

I'he  number  of  arithmetic  operations  being  known,  the  gate  count  of  each  architecture 
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can  be  computed.  Assume  that  each  multiplier  has  12  stages,  and  that  each  stage  consists 
ot  one  multiplier  cell  containing  72  gates.  Adders  and  subtracters  cost  36  gates  each.  The 
arrays  in  the  systolic  architecture  contain  336NMw  gates^^.  Denote  by  Gfft-,  Grouted-,  and 
G.,i,3toitci  the  numbers  of  gates  in  the  FFT,  routed,  and  systolic  architectures,  respectively. 
These  gate  counts  can  be  calculated  by  the  equations: 

Gfft  =  (12  •  72  •  Mfft)  +  (36  •  Afft)  ,  ( 13) 

Grouted  =  (12  •  72  •  Mtv)  +  (36  •  Aw)  ,  (14) 

GsystoHc  =  (12-72-Mw)  +  (336-.V-Mw)  .  (15) 

riio  costs  of  the  three  architectures  are  shown  in  Fig.  24  as  a  function  of  the  number 
of  points  N.  Of  the  three,  the  routed  architecture  appears  to  be  the  least  e.Kpt'iisi ve. 
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Figure  24:  Plot  of  the  the  number  of  gates  in  the  FFT,  routed,  and  systolic  architectures, 
as  a  function  of  the  number  of  points  N. 

I'll'-  l-’FT  architecture  contains  two  to  three  times  more  gates  than  the  routed  VVFT.A  for 
comparable  transform  sizes.  The  systolic  architecture  is  much  more  expensive  than  the 
two  others. 


-'Si'c  Section  2.0  for  an  analysis  of  the  systolic  architecture  cost. 
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8.0  CONCLUSION 


In  this  report,  a  routed  architecture  for  the  Winograd  Fourier  transform  algorithms 
(VVFTA)  has  been  presented.  This  bit-serial  architecture  maps  an  A^-point  WFT.A  di¬ 
rectly  onto  a  VLSI  circuit.  The  resultant  layout  exhibit.-,  little  regularity  among  the 
adders,  but  it  covers  a  small  area  and  can  be  generated  by  computer-aided  design  tools. 
The  nesting  method  invented  by  Winograd  has  been  proposed  as  a  means  of  partitioning 
a  large  transformation  into  several  pieces  implemented  on  individual  circuits.  One  advan¬ 
tage  of  this  partitioning  approach  is  that  it  minimizes  the  number  of  multipliers,  which 
are  very  expensive.  Another  advantage  is  that  the  netted  circuits  can  be  all  of  the  same 
type.  This  reduces  the  design  time  and  the  number  of  mask  sets.  The  main  disadvantage 
is  that  it  requires  more  input/output  pins  than  some  other  approaches. 

The  logic  design  of  a  20-point  Winograd  fourier  transformation  (WFT)  circuit 
lias  lieen  presented  in  detail.  Data  formats,  which  have  an  impact  on  the  output  accu¬ 
racy  and  design  cost,  have  been  examined.  Floo^plans  for  the  pre-  and  post-multipliers 
additions  have  l)een  proposed,  along  with  logic  diagrams  for  the  adding  and  subtracting 
cells.  Overflow  detection  has  been  included  in  the  design.  Low  cost  multipliers  for  two’s 
complement  input  and  output  data  have  been  designed.  The  circuit  can  be  programmed 
to  compute  either  20-point  DFTs  by  itself,  or  60-point  DFTs  when  it  is  connected  to  four 
other  circuits.  The  circuit  contains  about  55000  gates  and  has  196  pins.  The  whole  design 
has  been  simulated  on  a  computer  to  verify  the  correctness  of  its  logic  and  measure  the 
accuracy  of  the  output  data.  A  comprehensive  set  of  test  vectors  has  been  designed  for 
verifying  the  functionality  of  the  circuit  samples. 

Overall,  the  routed  architecture  appears  to  be  attractive  for  computing  moder¬ 
ate  size  DFTs  at  very  high  speeds.  The  routed  architecture  can  also  be  combined  with 
partitioning  techniques  like  the  prime  factor  algorithm  for  computing  larger  DFTs.  The 
possible  applications  include  electronic  warfare,  image,  radar,  speech,  and  sonar  process¬ 
ing. 
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A.O  DERIVATION  OF  WINOGRAD  FOURIER  TRANSFORM 
ALGORITHMS  FOR  20  POINTS  AND  60  POINTS 

111  this  section  we  derive  Winograd  Fourier  transforniation  algorithms  (WFTA)  for  20  and 
00  points.  We  assume  that  the  reader  has  a  basic  understanding  of  the  nesting  method 
of  Winograd  [2]. 

A.l  DERIVATION  OF  A  20-POINT  ALGORITHM 

■Suppose  that  one  wants  to  derive  an  algorithm  for  computing  V-point  DFTs.  Let 
.V  =  ViiVj,  where  Ny  and  are  relatively  prime.  Assume  that  an  algorithm  is  known  for 
lomputing  :Vi-point  DFTs  using  Ai  additions  and  M\  multiplications,  and  that  another 
algorithm  is  known  for  computing  iV2-point  DFTs  using  A2  additions  and  M2  multiplica¬ 
tions.  Then  the  A  j-point  and  V2-point  algorithms  can  be  “nested”  to  yield  an  A-point 
algorithm  requiring  M\M2  multiplications  and  N2A1  -f  M\A2  additions. 

A  20-point  algorithm  is  now  derived  from  the  4-  and  5-point  algorithms.  Let 
.V  =  20.  .V,  =  4,  and  N2  =  5.  The  resultant  20-point  algorithm  contains 

M  =  Ml  M2  =  4  •  6  =  24  complex  multiplications, 

and 

.4  =  /V2A1  -I-  M1A2  =  5  •  8  -f  4  •  17  =  108  complex  additions. 

With  .Vi  =  5  and  N2  =  4,  it  would  contain  instead 


A'  =  A2/4i  -f  iV/i.42  =  4-17-f6-8  =  116  complex  additions 


and  be  more  expensive. 

By  the  Chinese  Remainder  Theorem  (CRT),  every  integer  0  <  n  <  N  —  1  can 
be  ri'presented  by  the  pair  (ni,n2)  such  that  rz]  =  n  mod  Ni  and  122  =  n  mod  .V2.  I’aking 


20  =  1  ■  5.  we  get  the  mapping: 

0  -  (0,0) 

1  -  (Cl) 

2  -  (2,2) 

3  -  (3,3) 

4  -  (0.4) 

5 -(1,0) 

6 -(2,1) 

7  -  (3.2) 

8  -  (0.3) 

9 -(1.4) 

10  -  (2,0) 

11  -  (3,1) 

12  -  (C.21 

13  -  (1.3) 

14  -  (2.4) 

15  -  (3,0) 

16  -  (0.1) 

17  -  (1.2) 

18  -  (2.3) 

19  -  (3.4) 
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which  put  into  lexicographical  order  yields:  0,  16,  12,  8,  4,  5,  1,  17,  13,  9,  10,  6,  2,  IS,  14, 
15,  1 1,  7,  3,  and  19. 

Let  us  define 


and  apply  the  following  4-point  algorithm  [2]  to  these  vectors: 

Si  =  Ho  -f  32  S2  =  ao  —  32 

S3  =  3i  -f  33  S4  =  3]  —  83 

S5  =  S]  -|-  S3  Sg  =  Si  —  S3 

Ml  =  W  ■  S5  M2  =  W  •  S6 

M3  =  W  •  S2  M4  =  f  sin  u  W  •  S4 

S7  =  M3  4-  M4  Sg  =  M3  -  M4 

Aq  —  IVIi  Ai  —  S7 
A2  =  M2  A3  =  Sg 

where  W  denotes  the  5-point  transformation  given  in  Section  2.2,  and  v  =  — .A  20- 
poiiil  Winograd  Fourier  transformation  algorithm  is  obtained. 

The  remaining  step  consists  of  computing  the  values  of  u  and  v  which  appear 
in  the  twiddle  factors  of  the  algorithm  (u  comes  from  the  5-point,  v  from  the  4-point). 
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Beginning  with  u,  the  value  of  b  must  be  calculated  using  the  equation  [2]: 


(A.l) 

The  CRT  indicates  that  (0, 1)  corresponds  to  16,  thus 

,  2nt  .  2itt  L 

(e-lo-)i6  _  (e-— 

(A.2) 

and 

b  =  4  . 

(A.3) 

The  value  of  u  in  the  N-point  algorithm  is  equal  to  b  times  its  value 
algorithm  (  — ^): 

in  the  A^-point 

_4/  27r  Stt 

.5  ^  5  ■ 

(A.4) 

T(j  obtain  v.  the  value  of  a  must  be  computed  using  the  equation  [2]: 

(A.o) 

The  CRT  maps  (1,0)  into  5,  thus 

m 

II 

) 

(A.6) 

and 

a  =  1  . 

(A.7) 

Consequently,  the  value  of  v  is  unchanged: 

I'M 

1 

II 

(A.S) 

This  completes  the  derivation  of  the  20-point  VVFTA.  The  resultant  algorithm  is 
now  simply  stated  in  the  same  raw  format  that  was  input  to  our  simulator  for  validation, 
riie  additions  and  multiplications  are  all  on  complex  data. 
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A. 2  20-POINT  ALGORITHM 


V  =  -pi/2 

u  =  -8*pi/5 

si  =  aO+alO 

s6  =  aO-alO  sll  =  a5+al5 

s2  =  al6+a6 

s7  =  al6-a6  sl2  =  al+all 

s3  =  al2+a2 

s8  =  al2-a2  sl3  =  al7+a7 

s4  =  a8+al8 

s9  =  a8-al8  sl4  =  al3+a3 

s5  =  a4+al4 

slO  =  a4-al4  sl5  =  a9+al9 

sl6  =  a5-al5 

s21  =  sl+sll 

s26  = 

sl-sll 

sl7  =  al-all 

s22  =  s2+sl2 

s27  = 

s2-sl2 

sl8  =  al7-a7 

s23  =  s3+sl3 

s28  = 

s3-sl3 

sl9  =  al3-a3 

s24  =  s4+sl4 

s29  = 

s4-sl4 

s20  =  a9-al9 

s25  =  s5+sl5 

s30  = 

s5-sl5 

s31  =  s22+s25 

s32  *  s22-s25 

s33 

=  s24+s23 

s34  =  s24-s23 

s35  =  s31+s33 

s36 

*  s31-s33 

s37  =  s32+s34 

s38  =  s35+s21 

s39  =  s27+s30 

s40  =  s27-s30 

s41 

*  s29+s28 

s42  =  s29-s28 

s43  =  s39+s41 

s44 

=  s39-s41 

s45  =  s40+s42 

s46  =  s43+s26 

s47  =  s7+sl0 

s48  =  s7-sl0 

s49  = 

s9+s8 

s50  =  s9-s8 

s51  =  s47+s49 

s52  * 

s47-s49 

s53  =  s48+s50 

s54  =  s51+s6 

s55  =  sl7+s20 

s56  =  sl7-s20 

s57 

=  sl9+sl8 

s58  =  sl9-sl8 

s59  =  s55+s57 

s60 

=  s55-s57 

s61  =  s56+s58 

s62  =  s59+sl6 

mO  =  s38 

ml  =  ((cos(u)+cos(2*u))/2-l)*s35 
m2  =  ((cos(u)-cos(2*u))/2)*s36 
m3  =  j*(sin(u)+sin(2*u) )*s32 
m4  =  j*sin(2*u)*s37 
m5  =  j*(sin(u)-sin(2*u) )*s34 
m6  =  s46 

m7  =  ( (cos(u)+cos(2*u) )/2-l)*s43 
mS  =  ((cos(u)-cos(2*u))/2)*s44 
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m9  =  j*(sin(u)+sin(2*u))*s40 
mlO  =  j*sin(2*u)*s45 
mil  =  j*(sin(u)-sin(2*u))*s42 
ml2  =  s54 

ml3  =  ((cos(u)+cos(2*u))/2-l)*s51 
ml4  =  ( (cos(u)-cos(2*u) )/2)*s52 
ml5  =  j*(sin(u)+sin(2*u))*s48 
ml6  =  j*sin(2*u)*s53 
ml7  =  j*(sin(u)-sin(2*u))*s50 
mis  =  j*sin(v)*s62 

ml9  =  j*sin(v)*((cos(u)+cos(2*u))/2-l)*s59 
m20  =  j*sin(v)*( (cos(u)-cos(2*u))/2)*s60 
m21  =  j*sin(v)*j*(sin(u)+sin(2*u))*s56 
m22  =  j*sin(v)*j*sin(2*u)*s61 
m23  =  j*sin(v)*j*(sin(u)-sin(2*u))*s58 


s63 

s66 

s69 

s72 

s75 

s78 

s81 

s84 

s87 

s90 

s93 

s96 


mO+ml 

s64  = 

s63+m2  s6b  =  s63-m2 

m3-m4 

s67  = 

m4+m5  s68 

=  s64+s66 

s64-s66 

s70 

=  s65+s67 

s71  = 

s65-s67 

m6+m7 

s73  = 

s72+m8  s74  =  s72-m8 

m9-ml0 

s76  = 

'  mlO+rall 

s77  =  s73+s75 

s73-s75 

s79 

=  s74+s76 

s80  = 

s74-s76 

ml2+ml3 

s82 

=  s81+ml4 

s83  = 

s81-ml4 

ml5-ml6 

s85 

=  ml6+ml7 

s86  = 

s82+s84 

s82-s84 

s88 

=  s83+s85 

s89  = 

s83-s85 

ml8+ml9 

s91 

=  s90+m20 

s92  * 

s90-m20 

m21-m22 

s94 

=  m22+m23 

s95  * 

s91+s93 

s91-s93 

s97 

=  s92+s94 

s98  = 

s92-s94 

s99  =  ml2+ml8 
slOO  =  s86+s95 
slOl  =  s88+s97 
sl02  =  s89+s98 
sl03  =  s87+s96 


sl04  =  ml2-ml8 
sl05  =  s86-s95 
sl06  =  s88-s97 
sl07  =  s89-s98 
sl08  =  s87-s96 


AO  = 

mO 

A16 

=  s68 

A12 

=  s70 

A8  = 

s71 

A4  = 

s69 

A5  =  s99  AlO  =  m6  '  A15  =  sl04 


A1  =  slOO  A6  =  s77 
A17  =  slOl  A2  =  s79 
A13  =  sl02  A18  =  s80 
A9  =  sl03  A14  =  s78 


All  =  sl05 
A7  =  sl06 
A3  =  sl07 
A19  =  sl08 


A-5 


APPENDIX 


A. 3  DERIVATION  OF  A  60-POINT  ALGORITHM 

With  the  help  of  the  20-point  algorithm,  we  now  derive  an  algorithm  for  60  =  3-20  points 
which  contains  M  =  3  •  24  =  72  complex  multiplications  and  A  =  20  ■  6  -1-  3  •  108  =  444 
complex  additions.  The  procedure  is  exactly  the  same  as  for  the  20-point  algorithm. 
Note  that  the  factors  3,  4,  and  5,  are  mutually  prime.  If  they  were  not,  the  Winograd 
nesting  method  could  not  be  used. 

Using  the  Chinese  Remainder  Theorem,  the  inputs  and  outputs  are  reordered  as 
follows:  0,  21,  42,  3,  24,  45,  6,  27,  48,  9,  30,  51,  12,  33,  54,  15,  36,  57,  18,  39,  40.  1,  22, 
13.  4.  25,  46,  7,  28,  49,  10,  31,  52,  13.  34,  55,  16,  37,  58,  19,  20,  41,  2,  23,  44.  5,  26.  47.  8. 
29.  50,  11,  32,  53,  14,  35,  56,  17,  38,  59.  Let  us  define  the  vectors 


a,',  = 

a',  = 


a.2  = 
A'  = 
A',  = 
A'  = 


(«0  «21  «42  fl3  «24  “45  ^48  «9  «30  <151  «12  «33  <<54  <ll5  <^36  «57  <118  «39) 

(«40  <ll  <122  “43  <l4  <<25  <^46  <*7  <^28  <*49  <llO  <<31  <l52  <<13  <l34  <l55  <ll6  <^37  <l58  <ll9) 

(<120  rtm  a-2  0,23  <l44  <l5  <^26  <*47  <l8  0-29  <^50  <lll  <l32  <l53  <<14  <l35  <l56  <ll7  0,38  O^g) 

(Ao  A21  A42  A3  A24  A45  Ae  A27  A48  Ag  A30  A51  Ai2  A33  A54  A15  .43$  A57  /iig  .139) 
(A40  Ai  .422  A43  A4  A2S  A46  A7  A28  A49  Aio  A31  A52  .4i3  A34  A.55  Ai6  ,437  -458  Aj;)) 

(A20  A41  A2  A23  A44  A5  A26  A47  Ag  A29  A50  All  A32  A53  >li4  A35  /ise  Ai7  43g  .4,59) 


and  apply  to  these  vectors  the  3-point  algorithm  [2] 

Si  =  ai  -{-  S.2  S2  =  3i  —  82  S3  =  Si  -|-  So 


Mo  =  W  ■  S3  Ml  =  (cos  u;  —  1)W  •  Si  M2  =  i  sin  u;  W  •  S2 
S4  =  Mo  4-  Ml  S5  =  S4  -j-  M2  So  =  S4  —  M2 
Aq  =  Mo  Ai  =  S5  A.2  =  Sg 

u  hcre  this  timeW  denotes  a  20-point  transformation  and  w  =  —  y-  A  60-point  Winograd 
Fourier  transformation  algorithm  is  obtained. 

The  new  values  of  u,  v,  and  w  are  easily  found.  Since  (0, 1)  corresponds  to  21. 
b  is  equal  to  7,  and  hence 

,  ,  877  5677 


and 


777 


'’  =  d-3)  =  -- 
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Similarly,  (1,0)  corresponds  to  40,  a  is  equal  to  2,  and  the  value  of  w  in  the  60-point 
algorithm  is: 

27r  47r 

The  resultant  60-point  Winograd  Fourier  transform  algorithm  is  given  bellow. 
Again  all  the  additions  and  multiplications  are  on  complex  data. 


A.4  60-POINT  ALGORITHM 


u  = 

-(56/5)*pi 

V  = 

-(7/2)*pi 

w  = 

-(4/3)*pi 

sO  = 

a40+a20 

s20  =  a40-a20 

s40 

=  sO+aO 

si  = 

al+a41 

s21  =  al-a41  s41  = 

sl+a21 

s2  = 

a22+a2 

s22  =  a22-a2  s42  = 

s2+a42 

s3  = 

a43+a23 

s23  =  a43-a23 

s43 

*  s3+a3 

s4  = 

a4+a44 

s24  =  a4-a44  s44  = 

s4+a24 

s5  = 

a25+a5 

s25  ®  a25-a5  s45  = 

s5+a45 

s6  = 

a46+a26 

s26  =  a46-a26 

s46 

=  s6+a6 

s7  = 

a7+a47 

s27  =  a7-a47  s47  = 

s7+a27 

s8  = 

a28+a8 

s28  =  a28-a8  s48  = 

s8+a48 

s9  = 

a49+a29 

s29  =  a49-a29 

s49 

=  s9+a9 

slO 

=  al0+a50 

s30  =  alO-aSO 

s50 

=  sl0+a30 

sll 

=  a31+all 

s31  =  a31-all 

s51 

=  sll+a51 

sl2 

=  a52+a32 

s32  =  a52-a32 

s52 

=  sl2+al2 

sl3 

=  al3+a53 

s33  =  al3-a53 

s53 

=  sl3+a33 

sl4 

=  a34+al4 

s34  =  a34-al4 

s54 

«  sl4+a54 

sl5 

=  a55+a35 

s35  =  a55-a35 

s55 

=  sl5+al5 

sl6 

=  al6+a56 

s36  =  al6-a56 

s56 

=  sl6+a36 

sl7 

=  a37+al7 

s37  =  a37-al7 

s57 

=  sl7+a57 

sl8 

=  a58+a38 

s38  =  a58-a38 

s58 

=  sl8+al8 

sl9 

=  al9+a59 

s39  =  al9-a59 

s59 

=  sl9+a39 

s61 

=  s40+s50 

s66  =  s40-s50 

s71 

=  s45+s55 

s62 

=  s56+s46 

s67  =  s56-s46 

s72 

=  s41+s51 

s63 

=  s52+s42 

s68  =  s52-s42 

s73 

=  s57+s47 

s64 

=  s48+s58 

s69  =  s48-s58 

s74 

=  s53+s43 

s65 

=  s44+s54 

s70  =  s44-s54 

s75 

s  s49+s59 

A-7 


APPENDIX 


s76 

* 

s45-s55 

s81 

=  s61+s7l 

s86 

= 

s61-s71 

s77 

s41-s51 

s82 

=  s62+s72 

s87 

s62-s72 

s78 

* 

s57-s47 

s83 

=  s63+s73 

s88 

= 

s63-s73 

s79 

s 

s53-s43 

s84 

=  s64+s74 

s89 

3 

s64-s74 

s80 

= 

s49-s59 

s85 

=  s65+s75 

s90 

3 

s65-s75 

s91 

s 

s82+s85 

s92 

=  s82-s85 

s93 

3 

s84+s83 

s94 

= 

s84-s83 

s95 

=  s91+s93 

s96 

= 

s91-s93 

s97 

= 

s92+s94 

s98 

=  s95+s81 

s99 

3 

s87+s90 

slOO  =  s87-s90 

slOl 

=  s89+sJ 

sl02  *  s89-s88  sl03  =  s99+sl01  sl04  =  s99-sl01 

sl05  *  S100+S102  sl06  =  sl03+s86 

sl07  =  s67+s70  sl08  =  s67-s70  sl09  =  s69+s68 

sllO  =  s69-s68  sill  =  S107+S109  sll2  =  sl07-sl09 

sll3  =  S108+S110  sll4  =  S111+S66 

sll5  =  s77+s80  sll6  =  s77-s80  sll7  =  s79+s78 

sll8  =  s79-s78  sll9  =  sll5+sll7  sl20  =  sll5-sll7 

sl21  »  S116+S118  sl22  =  sll9+s76 

mO  =  s98 

ml  =  ((cos(u)+cos(2*u))/2-l)*s95 
m2  =  ((cos(u)-cos(2*u))/2)*s96 
m3  =  j*(sin(u)+sin(2*u) )*s92 
m4  *  j *sin(2*u)*s97 
m5  =  j*(sin(u)-sin(2*u))*s94 
m6  =  sl06 

m7  =  ((cos(u)+cos(2*u))/2-l)*sl03 
m8  *  ((cos(u)-cos(2*u))/2)*sl04 
m9  =  j*(sin(u)+sin(2*u))*sl00 
mlO  =  j't'sin(2*u)*sl05 
mil  =  j*(sin(u)-sin(2*u))*sl02 
ml2  =  sll4 

ml3  =  ( (cos(u)+cos(2*u) )/2-l)*slll 
ml4  =  ( (cos(u)-cos(2*u) )/2)*sll2 
ml5  *  j*(sin(u)+sin(2*u))*sl08 
ml6  =  j*sin(2*u)*sll3 
ml7  *  j*(sin(u)-sin(2*u))*sll0 
ml8  =  j*sin(v)*sl22 

ml9  =  j*sin(v)’''((cos(u)+cos(2*u))/2-l)*sll9 
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m20  =  j*sin(v)*((cos(u)-cos(2*u))/2)*sl20 
m21  =  j*sin(v)*j*(sin(u)+sin(2*u))*sll6 
n22  =  j*sin(v)*j*sin(2*u)*sl21 
m23  =  j*sin(v)*j*(sin(u)-sin(2*u))*sll8 

sl23  *  mO+ml  sl24  =  sl23+m2  sl25  =  sl23-m2 
sl26  =  m3-m4  sl27  =  m4+m5  sl28  =  sl24+sl26 
sl29  =  S124-S126  sl30  =  sl25+sl27  sl31  =  sl25-sl27 

sl32  =  m6+m7  sl33  =  sl32+m8  sl34  =  sl32-m8 
sl35  =  m9-ml0  sl36  =  mlO+mll  sl37  =  sl33+sl35 
sl38  =  S133-S135  sl39  =  sl34+sl36  sl40  =  sl34-sl36 

sl41  =  ml2+ml3  sl42  =  sl41+ml4  sl43  =  sl41-ml4 
sl44  =  inl5-ml6  sl45  =  ml6+ml7  sl46  =  sl42+sl44 
sl47  =  S142-S144  sl48  =  sl43+sl45  sl49  =  sl43-sl45 

sl50  =  ml8+ml9  sl51  =  sl50+m20  sl52  =  sl50-m20 
sl53  =  m21-m22  sl54  =  m22+m23  sl55  =  sl51+sl53 
sl56  =  S151-S153  sl57  =  sl52+sl54  sl58  =  sl52-sl54 

sl59  =  ml2+ml8  sl64  =  ml2-ml8 
sl60  »  S146+S155  sl65  =»  sl46-sl55 

sl61  =  sl48+sl57  sl66  =  sl48-sl57 

sl62  =  S149+S158  sl67  =  sl49-sl58 

sl63  =  sl47+sl56  sl68  =  sl47-sl56 

sl69  =  sO+slO  sl74  =  sO-slO  sl79  =  s5+sl5 
sl70  =  sl6+s6  sl75  =  sl6-s6  sl80  *  sl+sll 

sl71  =  sl2+s2  sl76  =  sl2-s2  sl81  *  sl7+s7 

sl72  =  s8+sl8  sl77  =  s8-sl8  sl82  *  sl3+s3 
sl73  =  s4+sl4  sl78  =  s4-sl4  sl83  =  s9+sl9 

sl84  =  s5-sl5  sl89  =  sl69+sl79  sl94  *  sl69-sl79 

sl85  =  sl-sll  sl90  =  sl70+sl80  sl95  *  sl70-sl80 

sl86  =  sl7-s7  sl91  =  sl71+sl81  sl96  =  sl71-sl81 

sl87  =  sl3-s3  sl92  =  sl72+sl82  sl97  =  sl72-sl82 

sl88  =  s9-sl9  sl93  =  sl73+sl83  sl98  =  sl73-sl83 

sl99  =  sl90+sl93  s200  =  sl90-sl93  s201 

s202  =  S192-S191  s203  =  sl99+s201  s204 

s205  *  s200+s202  s206  *  s203+sl89 

s207  =  S195+S198  s208  =  sl95-sl98  s209 


=  S192+S191 
=  sl99-s201 

=  S197+S196 
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s210 

= 

sl97-sl96 

s211 

= 

s207+s209 

s212 

= 

s207-s209 

s213 

= 

s208+s210 

s214 

s 

S211+S194 

s215 

= 

sl75+sl78 

s216 

= 

S175-S178 

s217 

= 

sl77+sl76 

s218 

= 

sl77-sl76 

s219 

= 

s215+s217 

s220 

= 

s215-s217 

s221 

= 

s216+s218 

s222 

s 

s219+sl74 

s223 

sl85+sl88 

s224 

= 

sl8S-sl88 

s225 

S187+S186 

s226 

= 

S187-S186 

s227 

s 

s223+s225 

s228 

= 

s223-s225 

s229 

s224+s226 

s230 

= 

s227+sl84 

m24  =  (cos(w)-l)*s206 

in25  =  (cos(w)-l)*((cos(u)+cos(2*u)  )/2-l)*s203 
m26  =  (cos(w)-l)*((cos(u)-cos(2*u) )/2)*s204 
m27  =  (cos(w)-l)*j*(sin(u)+sin(2*u))*s200 
m28  =  (cos(w)-l)*j*sin(2*u)*s205 
m29  =  (cos(w)-l)*j*(sin(u)-sin(2*u))*s202 
m30  =  (cos(w)-l)*s214 

m31  =  (cos(w)-l)*((cos(u)+cos(2*u))/2-l)*s211 
m32  =  (cos(w)-l)*((cos(u)-cos(2*u))/2)*s212 
m33  =  (cos(w)-l)*j*(sin(u)+sin(2*u))*s208 
m34  =  (cos(w)-l)*j*sin(2*u)*s213 
m35  =  (cos(w)-l)*j*(sin(u)-sin(2*u))*s210 
ra36  =  (cos(w)-l)*s222 

m37  =  (cos(w)-l)*((cos(u)+cos(2*u))/2-l)*s219 
m38  =  (cos(w)-l)*((cos(u)-cos(2*u))/2)*s220 
m39  =  (cos(w)-l)*j*(sin(u)+sin(2*u))*s216 
m40  =  (cos(w)-l)*j*sin(2*u)*s221 
m41  =  (cos(w)-l)*j*(sin(u)-sin(2*u))*s218 
m42  =  (cos(w)-l)*j*sin(v)*s230 

m43  =  (cos(w)-l)*j*sin(v)*((cos(u)+cos(2*u))/2-l)*s227 
m44  =  (cos(w)-l)*j*sin(v)*((cos(u)-cos(2*u))/2)*s228 
m45  =  (cos(w)-l)*j*sin(v)*j*(sin(u)+sin(2*u) )*s224 
m46  *  (cos(w)-l)*j*sin(v)*j*sin(2*u)*s229 
m47  =  (cos(w)-l)*j*sin(v)*j*(sin(u)-sin(2*u))*s226 

s231  =  m24+m25  s232  =  3231+m26  s233  =  s231-ra26 

s234  =  m27-m28  s235  =  ra28+m29  s236  =  s232+s234 

s237  =  S232-S234  s238  =  s233+s235  s239  =  s233-s235 

s240  =  m30+m31  s241  =  s240+m32  s242  =  s240-m32 

s243  =  m33-m34  s244  =  m34+m35  s245  =  s241+s243 

s246  *  s241-s243  s247  =  s242+s244  s248  =  s242-s244 
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s249 

= 

m36+m37 

s250  = 

s249+m38  s251  =  i 

s249-m38 

s252 

= 

m39-m40 

s253  = 

m40+m41  s254  =  s250+s252 

s255 

s 

s250-s252 

s256 

=  s251+s253 

s257 

=  s251-s253 

s258 

= 

m42+m43 

s259  = 

s258+m44  s260  =  ! 

s258-m44 

s261 

= 

m45-m46 

s262  = 

m46+m47  s263  =  s259+s261 

s264 

s 

s259-s261 

s265 

=  s260+s262 

s266 

=  s260-s262 

s267 

s 

m36+m42 

s272  = 

m36-m42 

s268 

= 

s254+s263 

s273 

=  s254-s263 

s269 

= 

s256+s265 

s274 

=  s256-s265 

s270 

s257+s266 

s275 

=  s257-s266 

s271 

= 

s255+s264 

s276 

=  s255-s264 

s277 

s 

s20+s30 

s282  = 

s20-s30  s287  =  s25+s35 

s278 

= 

s36+s26 

s283  = 

s36-s26  s288  =  s21+s31 

s279 

= 

s32+s22 

s284  = 

s32-s22  s289  =  s37+s27 

s280 

3 

s28+s38 

s285  = 

s28-s38  s290  =  s33+s23 

s281 

= 

s24+s34 

s286  = 

s24-s34  s291  =  s29+s39 

s292 

3 

s25-s35 

s297  » 

S277+S287 

s302  = 

s277-s287 

s293 

3 

s21-s31 

s298  * 

s278+s288 

s303  * 

s278-s288 

s294 

3 

s37-s27 

s299  = 

s279+s289 

s304  = 

s279-s289 

s295 

3 

s33-s23 

s300  = 

s280+s290 

s305  = 

s280-s290 

s296 

3 

s29-s39 

s301  = 

s281+s291 

s306  = 

s281-s291 

s307 

3 

s298+s301 

s308 

=  s298-s301 

s309 

=  s300+s299 

s310 

3 

s300-s299 

s311 

=  s307+s309 

s312 

=  s307-s309 

s313 

3 

s308+s310 

s314 

=  s311+s297 

s315 

3 

s303+s306 

s316 

*  s303-s306 

s317 

=  s305+s304 

s318 

3 

s305-s304 

s319 

=  S315+S317 

s320 

=  S315-S317 

s321 

3 

s316+s318 

s322 

=  s319+s302 

s323 

3 

s283+s286 

s324 

=  s283-s286 

s325 

=  s285+s284 

s326 

3 

s285-s284 

s327 

=  s323+s325 

s328 

=  s323-s325 

s329 

3 

S324+S326 

s330 

=  S327+S282 

s331 

3 

s293+s296 

s332 

=  s293-s296 

s333 

=  s295+s294 

s334 

3 

s295-s294 

s335 

=  s331+s333 

s336 

=  s331-s333 

s337 

3 

s332+s334 

s338 

=  s335+s292 

m48  = 

=  j *sin(w) *s314 

m49  = 

=  j*sin(w)*((cos(u)+cos(2*u))/2- 

-l)*s311 

All 
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mSO  =  j*sin(w)*((cos(u)-cos(2*u))/2)*s312 
m51  =  j*sin(w)*j*(sin(u)+sin(2*u))*s308 
m52  »  jt‘sin(w)*j*sin(2*u)*s313 
m53  =  j*sin(w)*j*(sin(u)-sin(2*u))*s310 
m54  =  j*sin(w)*s322 

m55  =  j*sin(w)*((cos(u)+cos(2*u))/2-l)*s319 
m56  =  j*sin(w)*((cos(u)-cos(2*u))/2)*s320 
m57  =  j*sin(w)*j*(sin(u)+sin(2*u))*s316 
m58  ®  j*sin(w)*j*sin(2*u)*s321 
m59  =  j*sin(w)*j*(sin(u)-sin(2*u))*s318 
m60  *  j*sin(w)*s330 

m61  =  j*sin(w)*((cos(u)+cos(2*u))/2-l)*s327 
m62  *  j*sin(w)’*((cos(u)-cos(2*u))/2)*s328 
m63  =  j*sin(w)*j‘*'(sin(u)+sin(2*u))*s324 
m64  =  j*sin(w)*j*sin(2*u)*s329 
m65  =  j*sin(w)*j*(sin(u)-sin(2*u))*s326 
m66  =  j*sin(w)*j*sin(v)*s338 

in67  »  j*sin(w)*j*siii(v)*((cos(u)+cos(2*u))/2-l)*s335 
m68  *  j*sin(w)*j*sin(v)*((cos(u)-cos(2*u))/2)*s336 
m69  *  j*sin,(w)*j*sin(v)*j*(sin(u)+sin(2»u))*s332 
m70  *  j*sin(w)*j*sin(v)*j*sin(2*u)*s337 
m7l  =  j*sin(w)*j*sin(v)oj*(sin(u)-sin(2*u))*s334 


s339 

s 

m48+m49 

s340  * 

s339+m50  s341 

=  s339-m50 

s342 

= 

m51-ra52 

s343  = 

m52+m53  s344  * 

s340+s342 

s345 

s 

S340-S342 

s346 

=  s341+s343  s347  =  s341-s343 

s348 

= 

m54+m55 

s349  = 

s348+m56  s350 

=  s348-m56 

s351 

= 

m57-m58 

s352  = 

m58+m59  s353  * 

s349+s351 

s354 

s 

s349-s351 

s355 

*  s350+s352  s356  *  s350-s352 

s357 

3 

m60+m61 

s358  = 

s357+m62  s359 

=  s357-m62 

s360 

s 

m63-m64 

s361  = 

m64+m65  s362  * 

s358+s360 

s363 

s 

s358-s360 

s364 

=  s359+s361  s365  =  s359-s361 

s366 

3 

m66+m67 

s367  = 

s366+m68  s368 

=  s366-m68 

s369 

3 

m69-m70 

s370  = 

m70+ra71  s371  = 

s367+s369 

s372 

3 

s367-s369 

s373 

*  s368+s370  s374  =  s368-s370 

s375 

3 

m60+m66 

s380  = 

ra60-in66 

s376 

3 

s362+s371 

s381 

=  s362-s371 

s377 

3 

s364+s373 

s382 

=  s364-s373 

s378 

3 

s365+s374 

s383 

=  s365-s374 
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s379 

* 

s363+s372 

s384 

S 

S363-S372 

s385 

s 

m0+m24  s405  =  s385+m48  s425  *  s385-m48 

s386 

= 

sl60+s268 

s406 

= 

s386+s376 

s426 

= 

s386-s376 

s387 

= 

sl39+s247 

s407 

= 

s387+s355 

s427 

3 

s387-s355 

s388 

= 

S167+S275 

s408 

3 

s388+s383 

s428 

3 

s388-s383 

s389 

= 

sl29+s237 

s409 

= 

s389+s345 

s429 

= 

s389-s345 

s390 

sl59+s267 

s410 

= 

s390+s375 

s430 

3 

s390-s375 

s391 

= 

sl37+s245 

s411 

3 

s391+s353 

s431 

3 

s391-s353 

s392 

3 

sl66+s274 

s412 

3 

s392+s382 

s432 

3 

s392-s382 

s393 

3 

sl31+s239 

s413 

3 

s393+s347 

s433 

= 

S393-S347 

s394 

3 

sl63+s271 

s414 

3 

s394+s379 

s434 

3 

s394-s379 

s395 

= 

m6+m30  s415  =  s395+m54  s435  =  s395-m54 

s396 

3 

S165+S273 

s416 

3 

s396+s381 

s436 

3 

s396-s381 

s397 

3 

sl30+s238 

s417 

3 

s397+s346 

s437 

= 

s397-s346 

s398 

3 

sl62+s270 

s418 

3 

s398+s378 

s438 

3 

s398-s378 

s399 

3 

S138+S246 

s419 

3 

s399+s354 

s439 

3 

s399-s354 

s400 

3 

S164+S272 

s420 

3 

s400+s380 

s440 

= 

s400-s380 

s401 

3 

sl28+s236 

s421 

3 

s401+s344 

s441 

3 

s401-s344 

s402 

3 

S161+S269 

s422 

3 

S402+S377 

s442 

3 

S402-S377 

s403 

3 

sl40+s248 

s423 

3 

s403+s356 

s443 

3 

s403-s356 

s404 

S 

sl68+s276 

s424 

3 

s404+s384 

s444 

3 

s404-s384 

AO  =  mO  A40  =  s405  A20  =  s425 


A21 

*  3160 

A1  = 

3406 

A41  = 

s426 

A42 

=  sl39 

A22 

=  3407 

A2  = 

s427 

A3  = 

=  sl67 

A43  = 

3408 

A23  = 

s428 

A24 

=  sl29 

A4  = 

^  3409 

A44  = 

s429 

A45 

=  sl59 

A25 

=  3410 

A5  = 

s430 

A6  = 

=  sl37 

A46  = 

^  3411 

A26  = 

3431 

A27 

=  sl66 

A7  = 

:  3412 

A47  * 

3432 

A48 

=  sl31 

A28 

=  s413 

A8  = 

3433 

A9  = 

:  sl63 

A49  = 

s414 

A29  = 

s434 

A30 

=  m6 

AlO  = 

3415 

A50  = 

s435 

A51 

=  sl65 

A31 

=  3416 

All 

=  3436 

A12 

=  sl30 

A52 

*  3417 

A32 

=  s437 

A33 

=  sl62 

A13 

=  3418 

A53 

=  s438 

A54 

=  sl38 

A34 

*  3419 

A14 

=  3439 

A15 

=  3164 

A55 

=  3420 

A35 

=  3440 

A36 

=  3128 

A16 

=  s421 

A56 

=  3441 
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A57  =  sl61  A37  *  s422  A17 
A18  =  sl40  A58  =  s423  A38 
A39  =  sl68  A19  =  s424  A59 


s442 

s443 

s444 
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B.O  TWIDDLE  FACTORS 

Tables  6-13  of  this  appendix  present  the  theoretical  values  and  12-bit  approximations  of 
the  twiddle  factors  in  the  sets  TFO,  TFl,  TF2,  and  TF3.  Each  12-bit  value  is  encoded  in 
two's  complement  notation  with  a  binary  point  after  the  third  bit.  The  quantization  error 
does  not  exceed  0.076%. 

The  bits  of  the  twiddle  factors  TFl,  TF2,  and  TF3  must  be  inverted  before  Ijcing 
stored  into  the  multipliers  described  in  this  document.  The  bits  of  TFO  need  not  be 
inverted.  The  least  significant  bit  must  be  stored  into  the  first  multiplier  stage,  where  the 
multiplicands  enter,  and  the  most  significant  bit  into  the  last  stage,  where  the  product 
exits. 
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Table  6:  Twiddle  factors  in  TFO  (real  side).. 


No 

Theoretical  Value 

Stored  Value 

12  Bits 

MSB  LSB 

Decimal 

0 

1.00000000000000 

(001000000000) 

1.000000000 

1 

-1.25000000000000 

(110110000000) 

-1.250000000 

2 

0.55901699437495 

(000100011110) 

0.558593750 

3 

1.53884176858763 

(001100010100) 

1.539062500 

4 

0.58778525229247 

(000100101101) 

0.587890625 

5 

0.36327126400268 

(000010111010) 

0.363281250 

6 

1.00000000000000 

(001000000000) 

1.000000000 

7 

-1.25000000000000 

(110110000000) 

-1.25000000U 

8 

0.55901699437495 

(000100011110) 

0.558593750 

9 

1.53884176858763 

(001100010100) 

1.539062500 

10 

0.58778525229247 

(000100101101) 

0.587890625 

11 

0.36327126400268 

(000010111010) 

0.363281250 

12 

1.00000000000000 

(001000000000) 

1.000000000 

13 

-1.25000000000000 

(110110000000) 

-1.250000000 

14 

0.55901699437495 

(000100011110) 

0.558593750 

15 

1.53884176858763 

(001100010100) 

1.539062500 

16 

0.58778525229247 

(000100101101) 

0.587890625 

17 

0.36327126400268 

(000010111010) 

0.363281250 

18 

-1.00000000000000 

(111000000000) 

-1.000000000 

19 

1.25000000000000 

(001010000000) 

1.250000000 

20 

-0.55901699437495 

(111011100010) 

-0.558593750 

21 

1.53884176858763 

(001100010100) 

1 .539062500 

22 

0.58778525229247 

(000100101101) 

0.587890625 

23 

0..36327126400268 

(000010111010) 

0..363281250 
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Table  7:  Twiddle  factors  in  TFO  (imaginary  side).. 


No 

Theoretical  Value 

Stored  Value 

12  Bits 

MSB  LSB 

Decimal 

0 

1.00000000000000 

(001000000000) 

1.000000000 

1 

-1.25000000000000 

(110110000000) 

-1.250000000 

2 

0.55901699437495 

(000100011110) 

0.558593750 

3 

-l.r''' 84176858763 

(110011101100) 

-1.539062500 

4 

-0.^8'  3525229247 

(111011010011) 

-0.587890625 

5 

0  36327126400268 

(111101000110) 

-0.363281250 

6 

1.00000000000000 

(001000000000) 

1.000000000 

7 

-1.25000000000000 

(110110000000) 

-1.250000000 

8 

0.55901699437495 

(000100011110) 

0.558593750 

9 

-1.53884176858763 

(110011101100) 

-1.539062500 

10 

-0.58778525229247 

(111011010011) 

-0.587890625 

11 

-0.36327126400268 

(111101000110) 

-0.363281250 

12 

1.00000000000000 

(001000000000) 

1.000000000 

13 

-1.25000000000000 

(110110000000) 

-1.250000000 

14 

0.55901699437495 

(000100011110) 

0.558593750 

15 

-1.53884176858763 

(110011101100) 

-1.539062500 

16 

-0.58778525229247 

(111011010011) 

-0.587890625 

17 

-0.36327126400268 

(111101000110) 

-0.363281250 

IS 

1.00000000000000 

(001000000000) 

1.000000000 

19 

-1.25000000000000 

(110110000000) 

-1.250000000 

20 

0.55901699437495 

(000100011110) 

0.558593750 

21 

1.53884176858763 

(001100010100) 

1.539062500 

22 

0.58778525229247 

(000100101101) 

0.587890625 

23 

0.36327126400268 

(000010111010) 

0.363281250 
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Table  8:  Twiddle  factors  in  TFl  (real  side).. 


No 

Theoretical  Value 

Stored  Value 

12  Bits 

MSB  LSB 

Decimal 

0 

1.00000000000000 

(001000000000) 

1.000000000 

1 

-1.25000000000000 

(110110000000) 

-1.250000000 

2 

-0.55901699437495 

(111011100010) 

-0.558593750 

3 

-0.36327126400268 

(111101000110) 

-0.363281250 

4 

-0.95105651629515 

(111000011001) 

-0.951171875 

5 

1.53884176858762 

(001100010100) 

1.539062500 

6 

1.00000000000000 

(001000000000) 

1.000000000 

7 

-1.25000000000000 

(110110000000) 

-1.250000000 

8 

-0.55901699437195 

(111011100010) 

-0.558593750 

9 

-0.36327126400268 

(111101000110) 

-0.363281250 

10 

-0.95105651629515 

(111000011001) 

-0.951171875 

11 

1.53884176858762 

(001100010100) 

1.539062500 

12 

1.00000000000000 

(001000000000) 

1.000000000 

13 

-1.25000000000000 

(110110000000) 

-1.250000000 

14 

-0.55901699437495 

(111011100010) 

-0.558593750 

15 

-0.36327126400268 

(111101000110) 

-0.363281250 

16 

-0.95105651629515 

(111000011001) 

-0.951171875 

17 

1.53884176858762 

(001100010100) 

1.539062500 

18 

1.00000000000000 

(001000000000) 

1.000000000 

19 

-1.25000000000000 

(110110000000) 

-1.250000000 

20 

-0.5.5901699437495 

(111011100010) 

-0.5585937.50 

21 

0.36327126400268 

(000010111010) 

0.36.32812.50 

22 

0.95105651629515 

(000111100111) 

0.951171875 

23 

-1.. 5.3884 176858762 

(110011101100) 

-1...39062.500 
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Table  9:  Twiddle  factors  in  TFl  (imaginary  side).. 


No 

Theoretical  Value 

Stored  Value 

12  Bits 

MSB  LSB 

Decimal 

0 

1.00000000000000 

(001000000000) 

1.000000000 

1 

-1.25000000000000 

(110110000000) 

-1.250000000 

2 

-0.55901699437495 

(111011100010) 

-0.558593750 

3 

0.36327126400268 

(000010111010) 

0.363281250 

4 

0.95105651629515 

(000111100111) 

0.951171875 

5 

-1.53884176858762 

(110011101100) 

-1.539062500 

6 

1.00000000000000 

(001000000000) 

1.000000000 

7 

-1.25000000000000 

(110110000000) 

-1.250000000 

8 

-0.55901699437495 

(111011100010) 

-0.558593750 

9 

0.36327126400268 

(000010111010) 

0.363281250 

10 

0.95105651629515 

(000111100111) 

0.951171875 

11 

-1.53884176858762 

(110011101100) 

-1.539062500 

12 

1.00000000000000 

(001000000000) 

1.000000000 

13 

-1.25000000000000 

(110110000000) 

-1.250000000 

14 

-0.55901699437495 

(111011100010) 

-0.558593750 

15 

0.36327126400268 

(000010111010) 

0.363281250 

16 

0.95105651629515 

(000111100111) 

0.951171875 

17 

-1.53884176858762 

(110011101100) 

-1.539062500 

IS 

-1.00000000000000 

(111000000000) 

-1.000000000 

19 

1.25000000000000 

(001010000000) 

1.250000000 

20 

0.55901699437495 

(000100011110) 

0.558593750 

21 

0.36327126400268 

(000010111010) 

0.363281250 

22 

0.95105651629515 

(000111100111) 

0.951171875 

23 

-1.53884176858762 

(110011101100) 

-1.539062500 
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Table  10:  Twiddle  factors  in  TF2  (real  side).. 


No 

Theoretical  Value 

Stored  Value 

12  Bits 

MSB  LSB 

Decimal 

0 

-1.50000000000000 

(110100000000) 

-1.500000000 

1 

1.87500000000000 

(001111000000) 

1.875000000 

2 

0.83852549156242 

(000110101101) 

0.837890625 

3 

0.54490689600402 

(000100010111) 

0.544921875 

4 

1.42658477444273 

(001011011010) 

1.425781250 

5 

-2.30826265288144 

(101101100010) 

-2.308593750 

6 

-1.50000000000000 

(110100000000) 

-1.500000000 

7 

1.87500000000000 

(001111000000) 

1.875000000 

8 

0.83852549156242 

(000110101101) 

0.837890625 

9 

0.54490689600402 

(000100010111) 

0.544921875 

10 

1.42658477444273 

(001011011010) 

1.425781250 

11 

-2.30826265288144 

(101101100010) 

-2.308593750 

12 

-1.50000000000000 

(110100000000) 

-1.500000000 

13 

1.87500000000000 

(001111000000) 

1.875000000 

14 

0.83852549156242 

(000110101101) 

0.837890625 

15 

0.54490689600402 

(000100010111) 

0.544921875 

16 

1.42658477444273 

(001011011010) 

1.425781250 

17 

-2.30826265288144 

(101101100010) 

-2.308593750 

18 

-1.50000000000000 

(110100000000) 

-1.500000000 

19 

1.87500000000000 

(001111000000) 

1.875000000 

20 

0.83852549156242 

(000110101101) 

0.837890625 

21 

-0.54490689600402 

(111011101001) 

-0.544921875 

22 

-1.42658477444273 

(110100100110) 

-1.425781250 

23 

2.30826265288144 

(010010011110) 

2.308593750 
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Table  11:  Twiddle  factors  in  TF2  (imaginary  side).. 


No 

Theoretical  Value 

Stored  Value 

12  Bits 

MSB  LSB 

Decimal 

0 

-1.50000000000000 

(110100000000) 

-1.500000000 

1 

1.87500000000000 

(001111000000) 

1.875000000 

2 

0.83852549156242 

(000110101101) 

0.837890625 

3 

-0.54490689600402 

(111011101001) 

-0.544921875 

4 

-1.42658477444273 

(110100100110) 

-1.425781250 

5 

2.30826265288144 

(010010011110) 

2.308593750 

6 

-1.50000000000000 

(110100000000) 

-1.500000000 

7 

1.87500000000000 

(001111000000) 

1.875000000 

8 

0.83852549156242 

(000110101101) 

0.837890625 

9 

-0.54490689600402 

(111011101001) 

-0.544921875 

10 

-1.42658477444273 

(110100100110) 

-1.425781250 

11 

2.30826265288144 

(010010011110) 

2.308593750 

12 

-1.50000000000000 

(110100000000) 

-1.500000000 

13 

1.87500000000000 

(001111000000) 

1.875000000 

14 

0.83852549156242 

(000110101101) 

0.837890625 

15 

-0.54490689600402 

(111011101001) 

-0.544921875 

16 

-1.42658477444273 

(110100100110) 

-1.425781250 

17 

2.30826265288144 

(010010011110) 

2.308593750 

IS 

1.50000000000000 

(001100000000) 

1.500000000 

19 

-1.87500000000000 

(110001000000) 

-1.875000000 

20 

-0.83852549156242 

(111001010011) 

-0.837890625 

21 

-0.54490689600402 

(111011101001) 

-0.544921875 

22 

-1.42658477444273 

(110100100110) 

-1.425781250 

23 

2.30826265288144 

(010010011110) 

2.308593750 
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Table  12:  Twiddle  factors  in  TF3  (real  side).. 


No 

Theoretical  Value 

Stored  Value 

12  Bits 

MSB  LSB 

Decimal 

0 

-0.86602540378444 

(111001000101) 

-0.865234375 

1 

1.08253175473055 

(001000101010) 

1.082031250 

2 

0.48412291827593 

(000011111000) 

0.484375000 

3 

0.31460214309120 

(000010100001) 

0.314453125 

4 

0.82363910354633 

(000110100110) 

0.824218750 

5 

-1.33267606400146 

(110101010110) 

-1.332031250 

6 

-0.86602540378444 

(111001000101) 

-0.865234375 

7 

1.08253175473055 

(001000101010) 

1.082031250 

8 

0.48412291827593 

(000011111000) 

0.484375000 

9 

0.31460214309120 

(000010100001) 

0.314453125 

10 

0.82363910354633 

(000110100110) 

0.824218750 

11 

-1.33267606400146 

(110101010110) 

-1.332031250 

12 

-0.86602540378444 

(111001000101) 

-0.865234375 

13 

1.08253175473055 

(001000101010) 

1.082031250 

14 

0.48412291827593 

(000011111000) 

0.484375000 

15 

0.31460214309120 

(000010100001) 

0.314453125 

16 

0.82363910354633 

(000110100110) 

0.824218750 

17 

-1.33267606400146 

(110101010110) 

-1.332031250 

18 

-0.86602540378444 

(111001000101) 

-0.865234375 

19 

1.08253175473055 

(001000101010) 

1.082031250 

20 

0.48412291827593 

(000011111000) 

0.484375000 

21 

-0.31460214309120 

(111101011111) 

-0.314453125 

22 

-0.82363910354633 

(111001011010) 

-0.824218750 

23 

1.33267606400146 

(001010101010) 

1.332031250 

B-8 


APPENDIX 


Table  13:  Twiddle  factors  in  TF3  (imaginary  side).. 


No 

Theoretical  Value 

Stored  Value 

12  Bits 

MSB  LSB 

Decimal 

0 

0.86602540378444 

(000110111011) 

0.865234375 

1 

-1.08253175473055 

(110111010110) 

-1.082031250 

2 

-0.48412291827593 

(111100001000) 

-0.484375000 

3 

0.31460214309120 

(000010100001) 

0.314453125 

4 

0.82363910354633 

(000110100110) 

0.824218750 

5 

-1.33267606400146 

(110101010110) 

-1.332031250 

6 

0.86602540378444 

(000110111011) 

0.865234375 

7 

-1.08253175473055 

(110111010110) 

-1.082031250 

8 

-0.48412291827593 

(111100001000) 

-0.484375000 

9 

0.31460214309120 

(000010100001) 

0.314453125 

10 

0.82363910354633 

(000110100110) 

0.824218750 

11 

-1.33267606400146 

(110101010110) 

-1.332031250 

12 

0.86602540378444 

(000110111011) 

0.865234375 

13 

-1.08253175473055 

(110111010110) 

-1.082031250 

14 

-0.48412291827593 

(111100001000) 

-0.484375000 

15 

0.31460214309120 

(000010100001) 

0.314453125 

16 

0.82363910354633 

(000110100110) 

0.824218750 

17 

-1.33267606400146 

(110101010110) 

-1.332031250 

IS 

-0.86602540378444 

(111001000101) 

-0.865234375 

19 

1.0825317.547.3055 

(001000101010) 

1.082031250 

20 

0.48412291827.593 

(000011111000) 

0.484375000 

21 

0.31460214309120 

(000010100001) 

0.314453125 

22 

0.82363910354633 

(000110100110) 

0.824218750 

23 

-1.33267606400146 

(110101010110) 

-1.332031250 
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C.O  LOGIC  SYMBOLS 


Inverter 

2-input  And 
2-input  Nand 
2-input  Or 
2-input  Nor 


A  s 
B  C. 


A  S 
B 

C _ C. 


Half  adder 


Full  adder 


2-to-1  Mux,  non-inverting 


2-to-1  Mux,  inverting 


2- input  Xor 

3- input  Nand 

3-input  Xor 

Flip-flop 


3-to-1  Mux,  inverting 


4-to-1  Mux,  non-inverting 


Addition 


20-point 

transformation 


Subtraction 


Figure  25:  Logic  symbols  used  throughout  this  document. 
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